Shahzad Bhatti Welcome to my ramblings and rants!

April 14, 2026

How Not to Write a Design Document

Filed under: Computing,Uncategorized — admin @ 8:22 pm

I have written design docs in large organizations where they were mandatory, and in startups where nobody asked for them. I still wrote them in because I hate expensive surprises. A good design doc is the cheapest place to catch bad assumptions. It is where you discover that the problem is not what the team thinks it is, that the current system is ugly for a reason, that the migration is harder than the redesign.

A bad design doc does the opposite. It makes the solution sound inevitable, skips trade-offs, and pushes the hard questions into implementation. That feels fast right up until production starts collecting interest on every shortcut. Years ago, many teams overdesigned everything. Then Agile arrived, BDUF became taboo, and that correction was needed. But like most pendulum swings in software, we overcorrected. “Don’t overdesign” slowly became “don’t think too much.” That is usually how bad design docs fail: not in review, but later, in production. This post is about those failures.


A design doc is not documentation

A design doc is not a status update. It is not proof that architecture was “discussed” and we can start coding. A design doc is a decision document. It should answer a small number of questions clearly:

  • What problem are we solving?
  • What is wrong with the current system?
  • What options did we consider?
  • Why is this option better?
  • What does it cost us?
  • How will it behave in production?
  • How will we deploy it, test it, observe it, and back it out?

If the document cannot answer those questions, it is not a design doc. It is a sales pitch. Because the biggest value of a design doc is that it forces a clarity. Full sentences are harder to write than bullets. They expose fuzzy thinking. They expose fake trade-offs. If you cannot explain the problem crisply in prose, you probably do not understand it well enough to build the solution.


Not every task needs a design doc.

I am not arguing for a memo before every commit. But if the change has a large blast radius, touches customer-facing behavior, takes weeks or months to implement, adds new dependencies, changes the operational model, then skipping the design doc is usually just deferred thinking. A proof of concept can help explore a technology. It cannot make the design decision for you.

That is another trap teams fall into. They build a small prototype, get something working, and then quietly promote the prototype into the architecture. A PoC can answer whether something is possible. It rarely answers whether it is the right choice once requirements, scale, operations, migration, and failure modes enter the picture.


Common design document anti-patterns

1. The doc starts with the solution

This is the most common failure. The title says:

  • “Move to Event-Driven Architecture”
  • “Build a Shared Workflow Engine”
  • “Adopt gRPC Internally”

By page two, the author is trying to invent a problem that justifies the answer already chosen. That is not design. That is confirmation bias. A real design doc starts with pain:

  • what is broken,
  • who feels it,
  • how often it happens,
  • what it costs,
  • and why now matters.

If the first section cannot explain the problem without naming the preferred technology, the doc is already weak.


2. The problem statement is vague

Bad docs hide behind words like: scalable, flexible, reliable, modern, future-proof. Those words mean nothing without numbers and constraints. Scalable to what? Reliable under what failure mode? A good design doc can explain the problem in one simple sentence. That sentence does not need to be clever. It needs to be clear.


3. No current-state analysis

A surprising number of redesigns are written as if the current system is too embarrassing to discuss. That is a mistake. Before proposing change, the document must explain:

  • what exists today,
  • what works,
  • what does not,
  • what improvements were already tried,
  • and which constraints came from history rather than incompetence.

Otherwise the new design floats in empty space. Reviewers cannot judge whether the proposal is necessary, proportional, or even safer than what exists now. I have seen teams rebuild old mistakes in new codebases because nobody bothered to explain why the old system looked the way it did.


4. No explicit decision points

One of the easiest ways to waste a review is to make nobody sure what decision is actually needed. You invite ten people. You walk through twelve pages. You get comments on naming, schemas, and edge cases. Then the meeting ends with “good discussion.” Good discussion about what? A strong design doc names the decisions up front:

  • Should this stay synchronous or become asynchronous?
  • Should we improve the current system or replace it?
  • Should we optimize for near-term delivery or long-term reuse?
  • Should this roll out in phases or all at once?

If reviewers do not know what they are approving, the meeting is not a design review. It is architecture theater.


5. Only one option is presented

A doc with one option is not doing design. It is asking for permission. A real alternatives section should compare at least:

  1. the current system,
  2. an incremental improvement,
  3. a larger redesign.

And it should evaluate each one with the same criteria like complexity, delivery time, migration cost, operational risk, long-term fit, rollback difficulty, etc. Weak alternatives are easy to spot. They exist only to make the preferred answer look inevitable. That is not analysis. That is stage lighting.


6. The doc is all diagrams and no behavior

The bad architecture diagram looks clean because it omits every painful thing.

What is missing?

  • retries/timeouts,
  • queues,
  • failure paths,
  • consistency model,
  • startup/shutdown behavior,
  • observability,
  • rollout boundaries.

A useful design doc explains system behavior, not just topology.

A diagram should force the hard questions, not hide them.


7. “Flexible” is used to hide indecision

This shows up everywhere like generic workflow engine, abstraction layer, configurable state machine, future-proof resource model, plugin architecture, etc. Flexibility is not free. It adds code, states, tests, docs, and future confusion. If the document argues for flexibility, it should name the exact variation it is buying. Otherwise “flexible” usually means “we do not want to decide yet.”


8. No stakeholders, only authors

A design doc written as if only the authors matter is usually missing half the constraints. A strong document names:

  • customers/downstream consumers,
  • partner teams,
  • SRE or operations owners,
  • security and compliance reviewers,
  • migration owners,
  • and the people who will actually operate the result.

9. No supporting data

Many bad docs are built entirely on intuition like ”customers want this”, “performance is a concern”, “the current solution does not scale”, etc. Maybe but show me. Use data where it matters:

  • latency numbers,
  • failure rates,
  • support burden,
  • cost profile,
  • customer pain,
  • migration friction,
  • adoption gaps.

And if the data is incomplete, say so. Honest uncertainty beats fake precision every time.


10. The document ignores requirements and jumps to implementation

A lot of docs rush into endpoints, services, queues, schemas, state machines, etc. Before they have separated:

  • business requirements,
  • technical requirements,
  • non-requirements,
  • and nice-to-haves.

That is how teams build the implementation they like instead of the system the problem actually requires. A good design doc works backward from requirements. It does not reverse-engineer requirements from the chosen design.


11. Functional requirements are detailed, non-functional ones are hand-wavy

This is one of the most expensive mistakes in design docs. The author carefully explains resource models and workflows. Then non-functional requirements get three weak lines like must be secure, must be scalable, must be observable. A serious design doc must be concrete about:

  • latency and performance,
  • availability and recovery,
  • scale assumptions,
  • capacity limits,
  • security boundaries,
  • privacy impact,
  • cost,
  • testing,
  • operations,
  • visibility,
  • monitoring,
  • alarming,
  • and release strategy.

Most painful incidents come from things that were “out of scope” in design but very much in scope in reality.


12. Observability is missing or lacking

This is the fastest path to production blindness. Bad docs do not define:

  • what metrics matter,
  • what logs matter,
  • what traces matter,
  • what dashboards must exist,
  • what alerts page on-call,
  • how operators diagnose dependencies, latency, or error spikes.

If the document cannot answer, “How will on-call debug this at 2 a.m.?” it is incomplete.


13. No test plan

“Unit tests will cover this” is not a test strategy. A real design doc should say how the change will be validated across:

  • unit tests,
  • integration tests,
  • end-to-end tests,
  • load tests,
  • canaries,
  • failure injection,
  • rollback validation,
  • and game days where appropriate.

A system that cannot be tested safely cannot be changed safely.


14. No deployment or release plan

The code path is described. The rollout path is not. Bad docs ignore:

  • phased rollout,
  • canaries,
  • feature flags,
  • cell or region rollout,
  • migration sequencing,
  • readiness checks,
  • automatic rollback,
  • launch criteria,
  • and customer onboarding gates.

Good design does not stop at build-time behavior. It includes how the system gets to production without hurting customers.


15. No rollback story

A deployment section without a rollback section is half a design. What happens if:

  • the canary regresses latency,
  • the schema change is wrong,
  • the queue backs up,
  • downstream clients fail,
  • or the new workflow leaves resources in a mixed state?

Every risky design needs a big red button. Not a vague hope. A real action:

  • stop traffic,
  • disable the feature,
  • revert the config,
  • drain the workers,
  • route to a degraded path,
  • return a controlled error,
  • or restore the last known good state.

If rollback is an afterthought, the rollout plan is fiction.


16. The doc describes the steady state but not the failure state

Most architecture docs assume every dependency is healthy and every component behaves. Real systems do not. A strong design doc explains:

  • what happens when a dependency times out,
  • when startup occurs during an outage,
  • when shutdown interrupts in-flight work,
  • when a rollout fails halfway,
  • and when rollback itself is imperfect.

17. The document is too long because it has no spine

Some docs are not too detailed. They are simply undisciplined. They include: screenshots, random notes, every edge case ever mentioned, and multiple separable topics jammed into one review. If the document cannot be read and discussed in one serious session, it is probably trying to do too much. Split the deep dives. Split the migration plan. Split the deployment details. Keep the core decision document focused on the actual decision.


18. The appendix carries the real argument

The main doc is vague. The important material is buried in appendices or links. That is backwards. The appendix should support the argument, not contain it. If reviewers need four extra docs to understand the recommendation, the author has not done the work.


19. The writing is vague because the thinking is vague

This is where writing quality matters more than most engineers admit. Weak design docs hide behind passive voice, overloaded jargon, bullets that dump unrelated ideas, and paragraphs that never land a clear point. Bad writing is often a design smell. The fastest way to discover a weak design is often to force it into full sentences. Full sentences make you commit to claims, assumptions, and trade-offs. They remove the hiding place. Writing is not separate from design. Writing is where the design proves whether it makes sense.


20. The review process is treated as ceremony

This is another place where teams lose value. They schedule a review too early, or too late. They invite the wrong people. They do not define the decisions needed. They edit the document while people are reading it. They leave without summarizing outcomes. Then they schedule a second review without properly addressing the first. A review should have a point:

  • what decision needs to be made,
  • who must be in the room,
  • what feedback is blocking,
  • what can be handled offline,
  • and what the next step is.

Reviewer time is expensive. Churn is self-inflicted damage.


21. No path forward after approval

Another common failure: the document ends at “approved.” No phases, milestones, follow-up docs, migration steps. Approval is not the end of the design. It is the start of accountable execution. A design doc should leave the reader knowing what happens next.


22. No ADRs or recorded decisions

The meeting happens. Trade-offs are discussed. A few choices are accepted. Then nothing is written down. Six months later nobody remembers:

  • why sync beat async,
  • why replacement beat incremental improvement,
  • why a dependency was accepted,
  • or why a future extension was deferred.

That is how architecture drifts. If a decision matters enough to debate, it matters enough to record.


23. The doc has no long-term point of view

This appears in two forms. The first is naive short-termism: the document solves the immediate issue but never explains where the architecture is heading. The second is fake future-proofing: the design becomes bloated with speculative flexibility. The right middle is simple:

  • say what this design intentionally does not solve,
  • state how it fits long-term goals,
  • and explain whether it can evolve in stages.

24. The document reads like it is trying to get approved, not trying to be right

This is the meta anti-pattern behind all the others. You can feel it when reading because the tone is too certain, the trade-offs are too clean, the unknowns are hidden. the alternatives are weak, etc. The best docs do not sound like that. They sound like real engineering:

  • here is the problem,
  • here is the current state,
  • here are the options,
  • here is why I prefer this one,
  • here is what it costs,
  • here is what can go wrong,
  • and here is what I still do not know.

That tone earns trust. The polished sales pitch does not.


The essential sections every good design doc should include

This is the part too many teams skip or dilute. If these sections are weak, the design is weak.

1. Executive summary and purpose

Keep it short. State the problem, the proposed direction, and the exact decision needed. This section should make it obvious why the reviewer is reading the document.

2. Background, problem statement, and current state

Explain what led to this proposal, what is working, what is not, what previous attempts were made, and why the current system is no longer enough.

3. Proposal, stakeholders, and supporting data

This is the core decision section. It should include the preferred option, stakeholders, supporting evidence, assumptions, constraints, risks, and whether the decision is reversible or one-way.

4. Architecture

This section should include a diagram, but also explain components, interactions, dependencies, data flow, control flow, consistency boundaries, and failure paths.

5. Alternatives

Compare the chosen approach with real alternatives: current state, incremental improvement, broader redesign. Use the same criteria for all of them. Be candid about the downsides of your preferred option.

6. Functional requirements

This section should cover interfaces, workflows, dependencies, data model or schema changes, lifecycle states, scalability assumptions, and reasons for adopting new technologies.

7. Non-functional requirements

This section should include performance, scale, availability, fault tolerance, rollback and recovery, security, privacy, compliance, testing, cost, operations, visibility, monitoring, and on-call support.

8. Future plans, release plan, and appendices

It should close with phased delivery, rollout gates, migration plan, open questions, references, FAQ, glossary, and a change log. Do not use appendices to smuggle in major new arguments. Use them to support the story the main document already told.


Writing advice most engineers ignore

This part matters because bad writing usually exposes bad thinking.

  • Keep the narrative tight: A design doc should read like an argument, not like a paste dump. The table of contents should tell a story: problem, current state, options, recommendation, trade-offs, rollout. If the table of contents itself is confused, the design probably is too.
  • Use full sentences: Bullets are useful. They are not enough. Full sentences force the author to commit to claims, assumptions, and trade-offs. They expose fuzzy logic faster than any architecture diagram.
  • Keep it short enough to review: If the document cannot be read and discussed in one serious session, split it. High-level design, deep dives, migration strategy, deployment details, and error-handling internals do not always belong in the same review.
  • Use diagrams carefully: Diagrams should reduce ambiguity, not add decoration. Name them, keep them consistent, and use them to show boundaries and flows.
  • Define acronyms once: Every team overestimates how obvious its vocabulary is. The doc should not require tribal knowledge to parse it.
  • Do not hide the hard part in links: Links reduce clutter. They do not replace the core argument. The main decisions must be understandable from the document itself.

What good looks like

A good design doc is not flashy. It is specific, honest and operational. It makes trade-offs visible. It gives reviewers something real to approve or reject. Most importantly, it treats writing as engineering work. The quality of the writing often exposes the quality of the thinking. If the problem is fuzzy, the writing will be fuzzy. If the decision is weak, the language will hide behind buzzwords. If the architecture has no operational model, the document will go strangely quiet around deployment, monitoring, and rollback.


Final thought

People say design docs slow teams down. Bad ones, ceremonial ones, bloated ones do. Good design docs save time because they move the expensive mistakes earlier, when they are still cheap. The real waste is not spending an extra day writing a serious design doc. The real waste is spending eighteen months undoing a design that nobody challenged properly because the document never forced the right conversation. That is how not to write a design document.

API Anti-Patterns: 50+ Mistakes That Will Break Your Production Systems

Filed under: Computing,Microservices — admin @ 2:25 pm

Over the past years I have written extensively about what makes distributed APIs fail. In How Abstraction Is Killing Software I showed how each layer crossing a network boundary multiplies latency and failure probability. In Transaction Boundaries: The Foundation of Reliable Systems and How Duplicate Detection Became the Dangerous Impostor of True Idempotency, I showed how subtle contract violations produce data corruption. Building Robust Error Handling with gRPC and REST, Zero-Downtime Services with Lifecycle Management, and Robust Retry Strategies for Building Resilient Distributed Systems explained error handling and operational health. My production checklist and fault tolerance deep-dive outlined those lessons actionable before a deployment. I also built an open-source API mock and contract testing framework, available at github.com/bhatti/api-mock-service that addresses how few teams verify their API contracts before clients discover the gaps in production. And in Agentic AI for Automated PII Detection I showed how AI-driven scanning can find the sensitive data leaking through APIs that manual review misses. Here, I am showing 50 anti-patterns across seven categories, each with a real-world example. Two laws sit at the foundation of everything that follows.

Hyrum’s Law: With a sufficient number of users of an API, it does not matter what you promised in the contract, i.e., all observable behaviors of your system will be depended upon by somebody.

Postel’s Law (the Robustness Principle): Be conservative in what you send, be liberal in what you accept.


The Anatomy of an API Failure

The diagram below maps where anti-patterns activate in a production request lifecycle. Red nodes are failure hotspots.


Section 1: API Design Philosophy Anti-Patterns

Design philosophy determines everything downstream.


1.1 Bottom-Up API Design: Annotation-Driven and Implementation-First

I have seen this pattern countless times where the team builds the service, then adds Swagger/OpenAPI annotations to the Java or Typescript classes to generate the API spec automatically. The spec is an artifact of the implementation and field names are whatever the ORM column is called. Endpoints are organized around the service layer, not the consumer’s mental model. The spec is generated post-hoc, often incomplete, and rarely reviewed before clients onboard.

In the end, you get an API that perfectly describes your internal implementation and is poorly shaped for external callers. Names leak internal terminology. Refactoring the implementation silently changes the API contract. The APIs are also strongly coupled to the UI that the same team is building and clients who onboard during development find a moving target.

Better approach: Spec-First Design: Write the OpenAPI or Protobuf spec before writing any implementation code. Use the spec as the contract that drives both the server implementation and the client SDK. Review the spec with consumers before implementation begins. Use code generation to produce server stubs from the spec.

# spec-first: openapi.yaml is the source of truth, written before implementation
openapi: "3.1.0"
info:
  title: Order Service
  version: "1.0.0"
paths:
  /v1/orders:
    post:
      operationId: createOrder
      summary: Create a new order
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateOrderRequest'
      responses:
        '201':
          description: Order created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Order'
        '400':
          $ref: '#/components/responses/ValidationError'
        '409':
          $ref: '#/components/responses/ConflictError'

For gRPC: write the .proto file first. The proto is the spec. Code-generate both server stubs and client libraries from it. Also, Google’s API Improvement Proposals (AIP) define a spec-first methodology for gRPC APIs that also maps to HTTP via the google.api.http annotation. A single proto definition can serve both gRPC clients and REST/JSON clients through a transcoding layer (Envoy, gRPC-Gateway), giving you the performance of binary protobuf and the accessibility of JSON from one spec:

service OrderService {
  rpc CreateOrder(CreateOrderRequest) returns (Order) {
    option (google.api.http) = {
      post: "/v1/orders"
      body: "*"
    };
  }
  rpc ListOrders(ListOrdersRequest) returns (ListOrdersResponse) {
    option (google.api.http) = {
      get: "/v1/orders"
    };
  }
}

1.2 Bloated API Surface: Non-Composable, UI-Coupled APIs

Another common pattern I have seen at a lot of companies is that a service that has hundreds or thousands of endpoints because every new feature needs some new data or behavior. Another artifact of poorly designed APIs is bloated response with all fields, all related resources, deeply nested because the first consumer needed everything and nobody added projection. This often occurs because the API is built by the same team building the UI. When the UI changes, new endpoints are added rather than the existing ones being generalized.

This results in integration without documentation becomes impossible. New clients must read everything to understand what to call. Duplicate endpoints proliferate, e.g., three different endpoints do approximately the same thing because each was built for a different screen without awareness of the others.

Composability principle: A well-designed API surface should be small enough that a competent developer can understand its structure in 30 minutes. Operations should compose small, focused operations that can be combined.

// Anti-pattern: purpose-built for one UI screen
rpc GetCheckoutPageData(GetCheckoutPageDataRequest) returns (CheckoutPageData);
// CheckoutPageData contains customer, cart, inventory, shipping, payment — all tightly coupled to one view

// Better: composable operations that any client can combine
rpc GetCustomer(GetCustomerRequest) returns (Customer);
rpc GetCart(GetCartRequest) returns (Cart);
rpc ListShippingOptions(ListShippingOptionsRequest) returns (ListShippingOptionsResponse);
// BFF layer aggregates these for the UI — keeps the core API clean

On API surface size: prefer a small number of well-understood, stable operations over a large surface of purpose-built ones. Use field masks or projections so callers opt-in to the fields they need.


1.3 Improper Namespace and Resource URI Design

Though most companies provide REST based APIs but often endpoints organized around verbs instead of resources: /getOrder, /createOrder, /deleteOrder, /updateOrderStatus. No consistent hierarchy. Related resources scattered across URL spaces: /orders and /order-history and /customer-purchases all refer to variants of the same concept with no clear relationship. Different teams own overlapping namespaces. A service called UserService that has endpoints for users, preferences, addresses, payment methods, and audit logs with no sub-resource structure.

The fundamental concept in REST is that URLs identify resources with nouns and HTTP verbs express actions on those resources. A resource hierarchy expresses relationships. This is not an aesthetic preference; it is the architectural model that makes REST APIs predictable without documentation.

# Anti-pattern: verb-based, flat, unorganized
GET    /getUser?id=123
POST   /createOrder
POST   /updateOrderStatus
GET    /getUserOrders?userId=123
DELETE /cancelOrder?orderId=456
GET    /getOrderHistory?customerId=123

# Correct: resource-oriented hierarchy
GET    /v1/users/{userId}                        # get user
POST   /v1/orders                                # create order
PATCH  /v1/orders/{orderId}                      # partial update (including status)
GET    /v1/users/{userId}/orders                 # orders for a user
DELETE /v1/orders/{orderId}                      # cancel order
GET    /v1/users/{userId}/orders?status=completed # filtered history

Namespace discipline: Keep related resources under the same base path. OrderService owns /v1/orders/**. UserService owns /v1/users/**. Related sub-resources live under their parent: /v1/orders/{orderId}/items, /v1/orders/{orderId}/events. Do not scatter related concepts across different roots based on internal team ownership.

Avoiding duplicate APIs: Before creating a new endpoint, ask whether an existing one can be parameterized to serve the new use case


1.4 The Execute Anti-Pattern: Bag of Params for Different Actions

Contrary to large surface, this anti pattern reuses same endpoint for different action depending on which parameters are present. The operation is effectively execute(action, params...) with a bag of optional fields, where different combinations of fields trigger different code paths.

// Anti-pattern: one RPC that does many things depending on type
message ProcessOrderRequest {
  string order_id = 1;
  string action = 2;           // "cancel", "ship", "refund", "update", "hold"
  string cancel_reason = 3;    // only used when action = "cancel"
  string tracking_number = 4;  // only used when action = "ship"
  double refund_amount = 5;    // only used when action = "refund"
  Address new_address = 6;     // only used when action = "update"
  string hold_until = 7;       // only used when action = "hold"
}

It feels like one operation (“do something with this order”). It minimizes the number of endpoints and it is easy to add a new action without changing the RPC signature.

It results in callers not understanding what the operation does without documentation explaining every action variant. Validation becomes a conditional maze — field cancel_reason is required when action = "cancel" but ignored otherwise. Generated SDK method signatures have no useful type information. Tests multiply exponentially.

Better approach: Separate operations for separate actions. Use oneof in protobuf for requests that have genuinely mutually exclusive parameter sets:

// Better: explicit operations, each with a clear contract
rpc CancelOrder(CancelOrderRequest) returns (Order);
rpc ShipOrder(ShipOrderRequest) returns (Order);
rpc RefundOrder(RefundOrderRequest) returns (Refund);

message CancelOrderRequest {
  string order_id = 1;
  string reason = 2;   // always relevant, always validated
}

// If you truly need a polymorphic command, use oneof to make it explicit:
message UpdateOrderRequest {
  string order_id = 1;
  oneof update {
    ShippingAddressUpdate shipping_address = 2;
    StatusUpdate status = 3;
    ContactUpdate contact = 4;
  }
  // oneof makes it structurally impossible to send two update types at once
  // Generated SDKs expose typed accessors — no stringly-typed action field
}

gRPC’s required/optional semantics: proto3 makes all fields optional by default. Use proto3’s optional keyword explicitly when a field’s absence carries meaning. You can use Protocol Buffer Validation to add more validation and enforce it in your boundary validation layer.


1.5 NIH Syndrome: Custom RPC Protocols Instead of Standards

At other places, I have seen teams build their own binary protocol over raw TCP because “gRPC has too much overhead.” They have custom framing, error codes, and multiplexing, which runs on a non-standard port, and needs special firewall rules. More often it is NIH (Not Invented Here) syndrome, believing that the standard tools are not good enough, combined with underestimation of the operational cost of maintaining a custom protocol.

In the end, custom protocols do not work through corporate proxies, CDNs, API gateways, or load balancers that only speak HTTP. Many enterprise environments permit only HTTP/HTTPS outbound and a custom port means the integration simply cannot be used. Tools like Wireshark, curl, Postman, and every observability platform will not understand your protocol. Debugging becomes dramatically harder because the entire ecosystem of HTTP tooling is unavailable.

What standard protocols actually give you:

ProtocolBest ForTransportStreaming
REST/HTTPPublic APIs, broad compatibilityHTTP/1.1, HTTP/2No (use SSE)
gRPCHigh-performance internal services, strong typingHTTP/2Yes (4 modes)
WebSocketBidirectional real-time communicationHTTP upgradeYes (full-duplex)
GraphQLFlexible queries, client-driven shapeHTTP/1.1, HTTP/2Subscriptions
Server-Sent EventsServer-push notificationHTTP/1.1Server-to-client

1.6 Badly Designed Streaming APIs

This is similar to previous pattern where a team that needs real-time data pushes builds a polling endpoint (GET /events?since=<timestamp>) and expects clients to poll every second. Or uses raw sockets that send large JSON blobs because “it’s streaming.” Or uses gRPC streaming but sends the entire dataset in one message instead of streaming rows incrementally. Or builds a custom long-polling mechanism with complex session state when SSE would have been simpler.

  • gRPC streaming modes:
service DataService {
  // Unary: single request, single response — most operations
  rpc GetOrder(GetOrderRequest) returns (Order);

  // Server streaming: one request triggers a stream of responses
  // Use for: sending large datasets, live feeds, log tailing
  rpc TailOrderEvents(TailOrderEventsRequest) returns (stream OrderEvent);

  // Client streaming: stream of requests, one response
  // Use for: bulk ingest, file upload in chunks
  rpc BulkCreateOrders(stream CreateOrderRequest) returns (BatchCreateOrdersResponse);

  // Bidirectional streaming: both sides stream independently
  // Use for: real-time chat, collaborative editing, game state sync
  rpc SyncOrderState(stream OrderStateUpdate) returns (stream OrderStateUpdate);
}
  • WebSocket is the correct choice for full-duplex browser communication where you need persistent connections with low latency in both directions. It upgrades from HTTP, passes through standard proxies, and is supported universally.
  • Server-Sent Events (SSE) is the correct choice for server-push-only scenarios (notifications, live dashboards) where the client only needs to receive, not send. SSE is HTTP.
  • Never build: custom TCP streaming, custom HTTP long-polling with complex session management, or custom binary framing when gRPC already provides exactly that.

1.7 Ignoring Encoding: JSON Everywhere Regardless of Cost

This anti-pattern can surfaces when a high-throughput internal service between two microservices you control uses JSON over HTTP/1.1 because “it’s simple.” Internal services process millions of messages per second serializing and deserializing large JSON payloads. The payload includes deeply nested structures with long field names repeated in every message. No compression. No binary encoding.

The performance reality: JSON is human-readable text with significant overhead:

  • Field names are repeated in every object (bandwidth and parse cost)
  • No schema enforcement at the encoding layer
  • No native binary type (base64 for bytes adds ~33% overhead)
  • UTF-8 string parsing is CPU-intensive at high throughput

Protobuf binary encoding is typically 3–10× smaller than equivalent JSON and 5–10× faster to serialize/deserialize at high volume. For internal service-to-service communication at scale, this is not a micro-optimization, it is a significant infrastructure cost difference.

Better approach: Choose encoding based on the use case:

ScenarioRecommended Encoding
Public REST API, browser clientsJSON (required for broad compatibility)
Internal service-to-service (high throughput)Protobuf binary over gRPC
Internal service-to-service (moderate)JSON over HTTP/2 with compression is acceptable
Mixed: public + internal clientsgRPC with HTTP/JSON transcoding via AIP
Event streaming (Kafka, Kinesis)Avro or Protobuf with schema registry

gRPC over HTTP/2 gives you multiplexed streams, binary encoding, strongly typed contracts, and bi-directional streaming in one package. For internal services at scale, there is rarely a justification for JSON over HTTP/1.1.

1.8 No Clear Internal/External API Boundary

In many cases, organizations may use gRPC internally and REST externally but in practice, the internal gRPC APIs were never held to any standard. For example, field names are inconsistent, operations are not paginated or there is no versioning.

  • Internal APIs become a inconsistent mess with duplicate functionality. Because internal APIs have no governance, each team designs theirs in isolation. Team A has GetUserProfile. Team B has FetchUser. Team C has LookupUserById. The internal API surface grows without bound.
  • Internal APIs leak into the external surface. The public REST API was designed conservatively, returning only what external callers need. But an internal team needs the same resource with additional fields. Rather than adding a projection or a scoped access tier, the quickest path is to promote the internal API endpoint. Over time, the line between “public” and “internal” API blurs. External clients discover undocumented internal fields (Hyrum’s Law again) and start depending on them.

Better approach — treat internal and external APIs as two tiers of the same governance model:

External API (public)         Internal API (private)
??????????????????????        ?????????????????????????
Same naming conventions       Same naming conventions
Same error shape              Same error shape
Same pagination model         Same pagination model
Same versioning policy        Same versioning policy — yes, even internally
Minimal response fields       Additional fields gated by internal scope/role
OpenAPI spec enforced         Proto spec enforced with protoc-gen-validate
Published SLA                 Published SLA (even if internal)
Contract tests in CI          Contract tests in CI

The key discipline is that internal APIs must follow the same standards as public APIs in terms of naming, versioning, error shapes, pagination. The only difference is the data they expose and the authentication model.

Handling the “extra fields” problem: use scoped projections rather than separate endpoints:

message GetOrderRequest {
  string order_id = 1;

  // Callers with INTERNAL_READ scope receive all fields.
  // External callers receive only the public projection.
  // The same RPC serves both — authorization determines the projection.
  FieldMaskScope scope = 2;
}

enum FieldMaskScope {
  FIELD_MASK_SCOPE_PUBLIC = 0;    // external callers: customer-visible fields
  FIELD_MASK_SCOPE_INTERNAL = 1;  // internal callers: + audit, cost, state flags
  FIELD_MASK_SCOPE_ADMIN = 2;     // ops callers: + all internal diagnostics
}

message Order {
  // Public fields — always returned
  string order_id = 1;
  OrderStatus status = 2;
  google.protobuf.Timestamp created_at = 3;

  // Internal fields — returned only to INTERNAL_SCOPE callers
  // Stripped at the API gateway for external requests
  string internal_routing_key = 100;
  CostAllocation cost_allocation = 101;

  // Admin fields — returned only to ADMIN_SCOPE callers
  repeated AuditEvent audit_trail = 200;
}

This approach keeps one canonical API, one proto spec, one set of tests. The authorization layer determines which fields a caller receives. The API gateway strips internal fields from external responses. The same spec, with scope annotations, documents both tiers.

On internal API governance: internal APIs need the same review gates as public APIs, even if the review is lighter. Some organizations enforce this via a service registry where every internal API must be registered, and the registry enforces naming and schema standards automatically.

1.9 Mixing Control-Plane and Data-Plane APIs

This anti-pattern occurs when a single API service handles both resource management (create a cluster, update a configuration, rotate a secret) and the high-frequency operational traffic that those resources serve (process a transaction, ingest a telemetry event). The same service, the same load balancer, the same deployment unit. A configuration change that causes a brief control-plane outage also takes down the data plane. A traffic spike on the data plane starves the management operations that operators need most during an incident.

Defining the planes: these terms come from networking and are now standard in cloud platform design.

PlanePurposeTypical TPSLatency requirementCaller
Control planeManage and configure resourcesLow (10s–100s/s)Relaxed (100ms–seconds)Operators, automation, UI
Data planeServe the workload those resources defineHigh (1,000s–millions/s)Strict (single-digit ms)End-users, services, devices

Real-world examples of the split done correctly:

  • Kubernetes: kube-apiserver is the control plane that creates Deployments, update ConfigMaps, scale ReplicaSets. The actual pod-to-pod traffic it orchestrates is the data plane. A kube-apiserver brownout does not stop running pods from serving traffic.
  • AWS API Gateway: The management API (create/update/delete routes, authorizers, stages) is the control plane. The actual HTTP proxy that forwards requests to Lambda or ECS is the data plane.

The scaling difference between management traffic and operational traffic is invisible until it isn’t. The consequence: Two failure modes, both serious.

  • First, data-plane load starves control-plane availability. A traffic spike on the data plane consumes all available threads, connections, and CPU. Operators cannot reach the management API to make the configuration change that would fix the problem.
  • Second, control-plane deployments risk data-plane availability. A risky configuration change deployed to the unified service takes down both planes together. A misconfigured authentication change gates all traffic, including the operational traffic that cannot tolerate any interruption.

Better approach:

Separate the planes at the service level, not just at the routing level. A reverse proxy that routes /mgmt/* to one backend and /v1/* to another on the same process does not achieve the isolation you need.

// Control-plane API — management operations, low TPS, relaxed latency
service OrderConfigService {
  // Create/update routing rules — takes effect asynchronously
  rpc UpsertRoutingRule(UpsertRoutingRuleRequest) returns (RoutingRule);
  rpc DeleteRoutingRule(DeleteRoutingRuleRequest) returns (google.protobuf.Empty);
  rpc ListRoutingRules(ListRoutingRulesRequest) returns (ListRoutingRulesResponse);

  // Capacity and rate limit configuration
  rpc SetRateLimit(SetRateLimitRequest) returns (RateLimit);

  // Returns async job — config changes propagate eventually to data plane
  rpc TriggerConfigSync(TriggerConfigSyncRequest) returns (ConfigSyncJob);
}

// Data-plane API — operational traffic, high TPS, strict latency
service OrderService {
  // Reads routing rules from LOCAL CACHE — never calls control plane in-band
  rpc CreateOrder(CreateOrderRequest) returns (Order);
  rpc GetOrder(GetOrderRequest) returns (Order);
  rpc ListOrders(ListOrdersRequest) returns (ListOrdersResponse);
}
  • Config propagation: the data plane must not call the control plane synchronously on the hot path. Configuration is pushed from the control plane to the data plane via an event stream or periodically polled and cached locally. The data plane starts with the last known good configuration and operates independently if the control plane is temporarily unavailable.
  • Deployment and SLA differences: control-plane deployments can be careful, canary-gated, and slow because the cost of a management API degradation is low (operators retry). Data-plane deployments should be fast and automated with aggressive auto-rollback because the cost of data-plane degradation is direct user impact.

Section 2: Contract & Consistency Anti-Patterns


2.1 Inconsistent Naming Across APIs

This anti-pattern is fairly common with evolution of API, e.g., EC2 uses CreateTags, ELB uses AddTags, RDS uses AddTagsToResource, Auto Scaling uses CreateOrUpdateTagswith four different verb shapes for the same semantic across four services.

Better approach: Establish a canonical vocabulary before first public release. For lifecycle operations: Create, Get, List, Update, Delete. Use id (server-assigned) vs name (client-specified) consistently. Use google.protobuf.Timestamp for all time values, never strings, never epoch integers.

message Order {
  string order_id = 1;                          // server-assigned ID
  string customer_name = 2;                     // client-specified name
  google.protobuf.Timestamp created_at = 3;     // typed timestamp, never string
  google.protobuf.Timestamp updated_at = 4;
  OrderStatus status = 5;                       // enum, not string, not int
}

enum OrderStatus {
  ORDER_STATUS_UNSPECIFIED = 0;  // always include; proto3 default
  ORDER_STATUS_PENDING = 1;
  ORDER_STATUS_CONFIRMED = 2;
  ORDER_STATUS_CANCELLED = 3;
}

2.2 Wrong HTTP Verb for the Operation

Despite adopting REST, I have seen companies misusing verbs like PATCH /orders/{id} that replaces the entire resource. GET /reports/generate that inserts a database record.

Note on GraphQL and gRPC: Both protocols legitimately tunnel all operations through HTTP POST. This is an intentional protocol design choice andnot an anti-pattern but it must be documented explicitly, and REST-layer middleware (caches, proxies, WAFs) must be configured to account for it.

VerbSemanticsIdempotentSafe
GETRetrieveYesYes
PUTFull replaceYesNo
PATCHPartial updateConditionallyNo
POSTCreate / non-idempotentNoNo
DELETERemoveYesNo

2.3 Breaking API Changes Without Versioning

A breaking change without versioning can easily break clients, e.g., a field renamed from customerId to customer_id, an error code that was 400 becomes 422, a previously optional field becomes required.

Safe (no version bump): adding optional request fields, adding response fields, adding new operations, making required fields optional. Never safe without a version bump: removing/renaming fields, changing field types, changing error codes for existing conditions, splitting an exception type, changing default behavior when optional inputs are absent.


2.4 Hyrum’s Law: Changing Semantic Behavior Without Versioning

With this anti-pattern, you fix a bug where ListOrders returned insertion order instead of alphabetical. You update an error message wording. You tighten validation. All of these feel internal. None are.

Better approach: Document everything observable. Use structured error fields (resource IDs, machine-readable codes) so clients never parse message strings. Treat any observable change including ordering, error message wording, validation leniency as potentially breaking.


2.5 Postel’s Law Misapplied: Silently Accepting Bad Input

This anti-pattern occurs when an API that accepts quantity: -5 and treats it as 0. An endpoint that silently drops unknown fields, then later adds a field with the same name with different semantics. An API that accepts both camelCase and snake_case then a new field orderType collides with legacy alias order_type.

Better approach: Be strict at the boundary. Reject invalid input with a structured ValidationException. Accept unknown fields only if explicitly designed for forward compatibility. Never silently coerce.


2.6 Bimodal Behavior

In this scenario, under normal load, ListOrders returns a complete consistent list with 200. Under high load, it silently returns a partial list still with 200.

Better approach: Your degraded paths must return consistent response shapes and correct status codes. A timeout is a 503 with Retry-After. A partial result is not a 200.


2.7 Leaky Abstractions

Examples of leaky abstractions include error messages contain internal ORM table names; pagination tokens are readable base64 JSON containing your database cursor.

Better approach: Map your domain model to your API, not your implementation. Pagination tokens must be opaque, encrypted, and versioned. Internal identifiers and infrastructure topology must never be inferred from responses.


2.8 Missing or Inconsistent Input Validation

This occurs when some fields are validated strictly, others silently truncated. The same field accepts null, "", and "0" on different endpoints.

Better approach: Validate at the boundary, consistently, for every operation.

message ValidationException {
  string message = 1;          // human-readable — never parse this in code
  string request_id = 2;
  repeated FieldViolation field_violations = 3;
}
message FieldViolation {
  string field = 1;            // "order.items[2].quantity"
  string description = 2;      // "must be greater than 0, got -5"
}

Section 3: Implementation Efficiency Anti-Patterns


3.1 N+1 Queries and ORM Abuse

In this case, you might have a ListOrders endpoint that fetches the list in one query, then issues a separate query per order for customer details, then another per order for line items. With 100 orders: 201 database round trips for what should be 1.

Network cost: Each cloud database round trip costs 1–5ms. 4,700 round trips = 4.7–23.5 seconds of pure network overhead before a byte of business logic executes. As covered in How Abstraction Is Killing Software, every layer crossing a network boundary multiplies the failure surface and latency budget.

Better approach: Return summary structures with commonly needed fields. Audit query plans with production-scale data before launch. Use eager loading for related data.


3.2 Missing Pagination

In this case, you might have a ListOrders endpoint that returns all results in a single response. Works at launch with small datasets. At scale some accounts have millions of records and responses become hundreds of megabytes, timeouts multiply, and clients start crashing on deserialization. Retrofitting pagination is a breaking change. If your endpoint always returned everything and you start returning a page with a next_page_token, clients that assumed completeness silently miss data. For example, EC2’s original DescribeInstances had no pagination. As customer instance counts grew into the thousands, responses became megabyte-scale XML documents that timed out and crashed clients. Retrofitting required making pagination opt-in legacy callers continued hitting the unbounded path for years after the fix shipped.

Guidance: every list operation must be paginated before first release:

  1. All List* operations that return a collection MUST be paginated no exceptions. The only exemption is a naturally size-limited result like a top-N leaderboard.
  2. Only one list per operation may be paginated. If you need to paginate two independent collections, expose two operations.
  3. Paginated results SHOULD NOT return the same item more than once across pages (disjoint pages). If the sort order is not an immutable strict total ordering, provide a temporally static view or snapshot the result set at the time of the first request and page through the snapshot.
  4. Items deleted during pagination SHOULD NOT appear on later pages.
  5. Newly created items MAY appear on not-yet-seen pages, but MUST appear in sorted order if they do.

The canonical request/response shape (REST and gRPC should follow the same field naming like page_size in, next_page_token out):

message ListOrdersRequest {
  // Optional upper bound — service may return fewer. Default is service-defined.
  // Client MUST NOT assume a full page means there are no more results.
  int32 page_size = 1 [(validate.rules).int32 = {gte: 0, lte: 1000}];

  // Opaque token from previous response. Absent on first call.
  string page_token = 2;

  // Filter parameters — MUST be identical on every page of the same query.
  // Service MUST reject a request where filters change mid-pagination.
  OrderFilter filter = 3;
}

message ListOrdersResponse {
  repeated OrderSummary orders = 1;

  // Absent when there are no more pages. Clients MUST stop when this is absent.
  // Never an empty string — absent means done, empty string is ambiguous.
  string next_page_token = 2;

  // Optional approximate total — document clearly that this is an estimate.
  // Do NOT guarantee an exact count; that requires a full scan on every call.
  int32 approximate_total = 3;
}

page_size is an upper bound, not a target: the service MUST return a next_page_token and stop early when its own threshold is exceeded. Attempting to fill a page to meet page_size for a highly selective filter on a large dataset creates an unbounded operation.

Changing page_size between pages is allowed: it does not change the result set, only how it is partitioned. Changing filter parameters is not allowed and must be rejected.


3.3 Pagination Token Anti-Patterns

Every one of the following mistakes has been made in production by major APIs. Each creates a permanent contract liability.

  • Readable token (leaks implementation): When you restructure your database, the token format is a public contract you cannot change. Clients construct tokens manually to jump to arbitrary offsets, bypassing your access controls. Making backwards-compatible changes to a plain-text token format is nearly impossible.
// Decoded token — client immediately knows your DB cursor format
{ "offset": 500, "shard": "us-east-1a", "table": "orders_v2" }
  • Token derived by client (S3 ListObjects mistake): S3’s original ListObjects required callers to derive the next token themselves: check IsTruncated, use NextMarker if present, otherwise use the Key of the last Contents entry. Every S3 client library had to implement this multi-step derivation. When S3 needed to change the pagination algorithm, all that client logic became incorrect. ListObjectsV2 was the clean-break solution an explicit opaque ContinuationToken issued by the server.
  • Token that never expires: A non-expiring token makes schema migrations impossible. If your pagination token format encodes version 1 of your database schema and you ship version 2, you must maintain a decoder for every token ever issued indefinitely. A 24-hour expiry gives you a bounded window after which all outstanding tokens are on the current format.
  • Token usable across users: A token generated for user A contains enough context to enumerate user B’s resources if the user check is missing. This is a data isolation vulnerability, not just a correctness bug.
  • Token that influences AuthZ: The service must not evaluate permissions differently based on whether a pagination token is present or what it contains. Authorization must be re-evaluated on every page request using the caller’s current credentials, not credentials cached inside the token.
// What the service stores inside the encrypted token — never visible to callers
message PaginationTokenPayload {
  string account_id = 1;      // bound to caller's account
  int32 version = 2;           // token format version for forward compatibility
  string cursor = 3;           // internal cursor — DB row ID, sort key, etc.
  google.protobuf.Timestamp issued_at = 4;   // for expiry enforcement
  bytes filter_hash = 5;       // hash of filter params — reject if changed
}
// This struct is AES-GCM encrypted before being base64-encoded and returned as next_page_token.
// The client sees only an opaque string. The server decrypts and validates on every use.

Client usage pattern: SDK helpers should abstract this loop, but every client must implement it correctly when calling raw:

page_token = None
while True:
    response = client.list_orders(
        filter={"status": "PENDING"},
        page_size=100,
        page_token=page_token   # None on first call
    )
    process(response.orders)
    page_token = response.next_page_token
    if not page_token:
        break   # no token = no more pages; do NOT check len(orders) < page_size

# NOTE: len(orders) < page_size does NOT mean last page.
# The service may return fewer results for internal reasons (execution time limit,
# scan limit, etc.) and still issue a next_page_token. Always check the token.

The single most common client-side pagination bug is treating a short page as a signal that pagination is complete.


3.4 Filtering Anti-Patterns

Filtering is where inconsistency compounds fastest as every team makes slightly different choices about semantics, validation, and edge cases, and callers cannot predict the behavior without reading the documentation for every endpoint individually.

The standard AND/OR semantic: all filtering implementations should follow EC2’s model: multiple values for a single attribute are OR’d; multiple attributes are AND’d. The order of attributes must not affect the result (commutative).

# EC2 canonical example
aws ec2 describe-instances \
  --filter Name=instance-state-name,Values=running \
  --filter Name=image-id,Values=ami-12345 \
  --filter Name=tag-value,Values=prod,test

# Equivalent SQL semantics:
# (instance-state-name = 'running')
# AND (image-id = 'ami-12345')
# AND (tag-value = 'prod' OR tag-value = 'test')

Swapping the order of the three filter arguments must return an identical result set. Clients must never need to order their filters to get correct behaviour.

Include/exclude filter variants for date, time, and status fields:

# Negation filter: exclude terminated instances from a different AZ
aws ec2 describe-instances \
  --filter Name=instance-state-name,Values=terminated,operator=exclude \
  --filter Name=availability-zone,Values=us-east-1a,operator=include

Timestamp fields MAY support not-before / not-after semantics. When supported, document the semantics exactly and validate that the provided value is a well-formed timestamp.

Filter structure in protobuf: use an enum for attribute names so the set of supported filters is machine-readable, and a validated pattern for values so wildcards and injection vectors are controlled:

message ListOrdersRequest {
  repeated Filter filters = 1 [(validate.rules).repeated.max_items = 10];
  int32 page_size = 2;
  string page_token = 3;
}

message Filter {
  FilterAttribute name = 1;    // enum — only supported attributes accepted
  repeated string values = 2   // OR'd together; max bounded
    [(validate.rules).repeated = {min_items: 1, max_items: 20}];
  FilterOperator operator = 3; // default INCLUDE; EXCLUDE for negation
}

enum FilterAttribute {
  FILTER_ATTRIBUTE_UNSPECIFIED = 0;
  FILTER_ATTRIBUTE_STATUS = 1;       // maps to Order.status
  FILTER_ATTRIBUTE_REGION = 2;       // maps to Order.region
  FILTER_ATTRIBUTE_CREATED_AFTER = 3;  // timestamp lower bound
  FILTER_ATTRIBUTE_CREATED_BEFORE = 4; // timestamp upper bound
  // Every value here must correspond to a field returned in OrderSummary.
  // Never add a filter attribute for an internal field not in the response.
}

enum FilterOperator {
  FILTER_OPERATOR_INCLUDE = 0;  // default — only matching resources returned
  FILTER_OPERATOR_EXCLUDE = 1;  // matching resources excluded from results
}

Filtering vs. specifying a list of IDs: these are different operations and must not be conflated. A filter is a predicate applied to the result set and it does not guarantee fetching a specific resource. Fetching a known set of resource IDs is a batch read (BatchGetOrders) and belongs in the batch operations standard, not in the filter parameter.

Flat parameters vs. structured filter list: two common shapes exist. Flat parameters (?status=PENDING&region=us-east) are simpler for simple cases and easier to cache with HTTP GET semantics. A structured filters list (as above) is more extensible and handles negation, wildcards, and complex predicates cleanly. Do not mix shapes across endpoints.


3.5 Chatty APIs and Network Latency Multiplication

Rendering a single page requires six sequential API calls. Each is 20ms. Sequential total: 120ms of pure network time before rendering begins. For example, Netflix’s move to microservices initially produced exactly this. Their solution: the BFF (Backend for Frontend) pattern, which is a purpose-built aggregation layer that parallelizes the six calls and returns one tailored response to the client.

Better approach: Design batch and composite read operations for primary use cases. Where callers need related resources together, provide projections. Parallelize what can be parallelized in your aggregation layer.


3.6 Synchronous APIs for Long-Running Operations

This is another pattern resulting from poor understanding of API behavior, e.g., POST /reports/generate blocks for 45 seconds, or it returns 202 Accepted (or 202 OK) with no body, no job ID, no link to check status, no way to cancel, and no way to know when it is safe to retry. Another related scenario is an API that was designed for a specific UI assumption, e.g., “the UI will only ever submit 100 IDs” but is exposed as a general API. When an automation script submits 10,000 IDs, the synchronous operation times out at the load balancer, the client retries, and two copies of the same job are now running. The API has no idempotency token, no job ID to check for an in-progress operation, and no way to cancel the duplicate. The missing async API primitives:

  1. No requestId in the 202 response: the caller has no handle to reference the job in subsequent calls, in logs, or in support tickets
  2. No status endpoint: the caller cannot poll for completion; the only signal is silence until a webhook fires
  3. No cancel operation: a misconfigured job consuming resources cannot be stopped without operator intervention
  4. No idempotency on submission: submitting the same job twice creates two jobs; there is no way to detect an in-progress duplicate
  5. No bounded input validation: the operation accepts an unbounded number of IDs because the UI never sends more than 100, but the API contract enforces no limit; automation sends 100,000 and the job runs for hours

Better approach is complete async job lifecycle:

// Submission: returns immediately with a Job handle
rpc StartExport(StartExportRequest) returns (Job) {
  option (google.api.http) = { post: "/v1/exports", body: "*" };
  // Response: HTTP 202 Accepted
}

// Status + result polling
rpc GetJob(GetJobRequest) returns (Job) {
  option (google.api.http) = { get: "/v1/jobs/{job_id}" };
}

// Cancellation — idempotent; safe to call multiple times
rpc CancelJob(CancelJobRequest) returns (Job) {
  option (google.api.http) = { post: "/v1/jobs/{job_id}:cancel", body: "*" };
}

message StartExportRequest {
  string client_token = 1;  // idempotency — same token returns existing job, not a new one
  repeated string record_ids = 2 [(validate.rules).repeated = {
    min_items: 1,
    max_items: 1000  // enforced at boundary — not a UI assumption baked into code
  }];
  ExportFormat format = 3;
}

message Job {
  string job_id = 1;              // stable handle for all subsequent calls
  string request_id = 2;         // trace ID for this submission specifically
  JobStatus status = 3;
  google.protobuf.Timestamp submitted_at = 4;
  google.protobuf.Timestamp completed_at = 5;  // absent until terminal state
  string result_url = 6;          // present only when status = SUCCEEDED
  JobError error = 7;             // present only when status = FAILED
  string self_link = 8;           // href to GET this job — no client URL construction needed
  string cancel_link = 9;         // href to cancel — clients should use these, not construct URLs
  int32 estimated_seconds = 10;   // hint for polling interval; not a guarantee
}

enum JobStatus {
  JOB_STATUS_UNSPECIFIED = 0;
  JOB_STATUS_QUEUED = 1;
  JOB_STATUS_RUNNING = 2;
  JOB_STATUS_SUCCEEDED = 3;
  JOB_STATUS_FAILED = 4;
  JOB_STATUS_CANCELLED = 5;
  JOB_STATUS_CANCELLING = 6;  // in-progress cancel — may still complete
}

The 202 Accepted response body must include:

  • job_id — the durable handle
  • self_link — the URL to poll (clients must not construct this)
  • cancel_link — the URL to cancel
  • estimated_seconds — polling hint
  • request_id — for logging and support correlation
HTTP 202 Accepted
Location: /v1/jobs/job-a3f9c2
{
  "job_id": "job-a3f9c2",
  "status": "QUEUED",
  "request_id": "req-7d2e1a",
  "self_link": "/v1/jobs/job-a3f9c2",
  "cancel_link": "/v1/jobs/job-a3f9c2:cancel",
  "estimated_seconds": 30
}

The Location header is standard HTTP for 202 include it so HTTP clients that follow redirects and standard library polling helpers work without custom code.

Idempotency on submission prevents duplicate jobs: if a client submits with client_token: "export-2024-q1" and receives a timeout, the retry with the same token returns the existing Job.

Bounded input enforced at the boundary: the max_items: 1000 constraint in StartExportRequest is enforced by protoc-gen-validate at the gRPC boundary instead of application code. If the constraint needs to change, it changes in the proto spec and the enforcement changes with it.


3.7 Batch Operations with Mixed Success/Error Lists

This occurs when a batch endpoint returns a single flat list where successes and failures are distinguished only by the presence of an error field. Callers must iterate every entry to determine outcome. For example, Firehose’s PutRecordsBatch uses this anti-pattern with a single mixed list. The correct model (adopted in newer AWS APIs) separates success and failure lists:

message BatchCreateOrdersResponse {
  repeated Order created_orders = 1;
  repeated OrderError failed_orders = 2;
  // HTTP 200 even if all items failed — per-item failure is in failed_orders
  // HTTP 400 only if the batch itself is malformed
}
message OrderError {
  string client_request_id = 1;  // correlates to request entry
  string error_code = 2;
  string message = 3;
}

Section 4: Idempotency & Transaction Anti-Patterns


4.1 Duplicate Detection Masquerading as True Idempotency

I wrote about this previously at How Duplicate Detection Became the Dangerous Impostor of True Idempotency and this issue arises when you create endpoint checks for an existing resource with the same name and returns it if found, calling this “idempotency.”

The correct idempotency token flow:

Stripe’s idempotency key is the canonical implementation. Every POST accepts an Idempotency-Key header. Stripe stores the key and the exact response. Same key within 24 hours replays the original response without re-executing. Same key with a different body returns 422.

Failure mode of duplicate detection: A response is lost in transit. The client retries. Meanwhile, another actor deleted the resource and a third created a new one with the same name. Your “idempotent” endpoint returns the new resource which the original client neither created nor controls.


4.2 Missing Idempotency Tokens on Create Operations

This scenario may occur when POST /orders returns an order ID without clientToken. The client gets a timeout. Retry = potential duplicate. No retry = potential data loss. For example, early payments APIs had this problem. A double-charge scenario: customer clicks Pay, network times out, app retries, customer charged twice. Stripe, Adyen, and Braintree all mandate idempotency keys for payment operations.

message CreateOrderRequest {
  // SDK auto-generates when absent; callers may provide their own.
  // Must be at least 64 ASCII printable characters for uniqueness.
  optional string client_token = 1;
  string customer_id = 2;
  repeated OrderItem items = 3;
}

4.3 Transaction Boundary Violations

I wrote about this anti-pattern previously at Transaction Boundaries: The Foundation of Reliable Systems. This occurs when a single API call updates two separate resources with no atomicity guarantee. The first update succeeds; the service crashes before the second. Caller retries; first update applies twice.

Better approach: Document atomicity guarantees explicitly. For cross-service consistency, use the Saga pattern with compensating transactions.


4.4 Full Update via PATCH (Implicit Field Deletion)

This occurs when PATCH /orders/{id} replaces the entire resource. Fields not included are deleted. A mobile client updating the shipping address silently deletes the contact email. For example, GitHub’s current v3 API is explicit: PATCH applies partial updates, PUT applies full replacement — documented unambiguously for every endpoint.

message UpdateOrderRequest {
  string order_id = 1;
  Order order = 2;
  // Only fields in update_mask are modified.
  // paths = ["shipping_address"] ? only shipping_address is touched
  google.protobuf.FieldMask update_mask = 3;
}

4.5 Missing Optimistic Concurrency Control

This occurs when two clients GET the same order, both modify it, both PUT back. The last write silently overwrites the first. For example, Kubernetes uses server-side apply with field ownership tracking and returns 409 Conflict with the specific fields in conflict. The ETag / If-Match pattern is the REST equivalent.

GET /orders/123 ? { ..., "version": "v7" }
PATCH /orders/123 + If-Match: v7
# If order is now v8: HTTP 409 Conflict { "current_version": "v8" }

4.6 Ignoring Concurrent Operation Safety

In this scenario, an API that allows parallel create-and-delete on the same resource without concurrency safety. A long-running create that can be invoked a second time while the first is in flight.

Better approach: Document concurrency semantics per operation. For long-running creates: check for an in-progress operation before starting a new one. Use idempotency tokens to prevent parallel retries from compounding.


Section 5: Error Handling Anti-Patterns


5.1 Opaque, Non-Actionable Errors

This anti-pattern occurs with poorly defined errors like: {"error": "Something went wrong"}. An HTML error page from a load balancer served as an API response. The same ValidationException returned for “field missing,” “field too long,” and “field contains invalid characters.”

Better approach: I wrote about better error handling previously at Building Robust Error Handling with gRPC and REST APIs. Seven standard exception types cover nearly all scenarios:

ExceptionHTTPRetryable
ValidationException400No
ServiceQuotaExceededException402No (contact support)
AccessDeniedException403No
ResourceNotFoundException404No
ConflictException409No (needs resolution)
ThrottlingException429Yes (honor Retry-After)
InternalServerException500Yes (with backoff)

Include request_id in every error response for support correlation. Include retry_after_seconds in 429 and 500 responses.


5.2 Error Messages That Clients Must Parse

This occurs where an API error looks like "ValidationException: The field 'order.items[2].quantity' must be greater than 0." A client parses the string to extract the field path. Major cloud providers have been forced to freeze exact error message phrasing for years because clients parse them. Changing a comma placement breaks production integrations.

Better approach: As described in Building Robust Error Handling with gRPC and REST APIs, error message text is for humans reading logs. Any information a program acts on must be in structured fields, never embedded in the message string.


5.3 Leaking Internal Information in Errors

Error messages contain database hostnames, stack traces, SQL fragments, or internal ARNs. 500 that says NullPointerException at com.internal.service.OrderProcessor:237.

Security principle: Return only information applicable to that request and requester. An unauthorized caller asking for a resource that does not exist receives 403 AccessDeniedException, not 404 ResourceNotFoundException that reveals non-existence is as informative as confirming existence.

Better approach: Catch and re-throw all dependency exceptions as service-defined error types. Include only a requestId for support lookup.


5.4 Exception Type Splitting and Proliferation

Splitting ConflictException into ResourceAlreadyExistsException, ConcurrentModificationException, and OptimisticLockException after release. Clients catching ConflictException silently miss the new subtypes.

The rule: Splitting an existing exception type is a breaking change. Adding fields to an existing exception type is always safe. Add new exception types only for genuinely new scenarios triggered by new optional parameters.


Section 6: Resilience & Operations Anti-Patterns


6.1 Missing Retry Safety in the SDK

This occurs when an SDK retrying any 5xx response including non-idempotent POST. No jitter causing synchronized retry storms.

Correct retry policy:

  • Retry only: idempotent operations (GET, PUT, DELETE) OR POST with clientToken
  • Retry on: 429 (honor Retry-After), 500 (if retryable: true), 503
  • Never retry: 400, 401, 403, 404, 409
  • Backoff: base 100ms, 2x multiplier, ±25% jitter, max 10s, max 3 attempts

6.2 Retry Storms and Missing Bulkheads

This occurs where all clients receive 429 simultaneously. All back off for exactly 2^n * 100ms. All retry at the same moment. The retry wave is as large as the original spike. I wrote previously Robust Retry Strategies for Building Resilient Distributed Systems that shows effective strategies for robust retries. For example, Netflix built Hystrix specifically to isolate downstream dependency thread pools. Slow responses in one pool cannot bleed into others. Circuit breakers open when error rates exceed thresholds, failing fast rather than queueing.


6.3 Hard Startup Dependencies

This occurs when a service cannot start unless all dependencies are reachable. During a dependency outage, no new instances can start so the deployment stalls and you cannot deploy fixes when you most need to.

Better approach: I wrote about this previously at Zero-Downtime Services with Lifecycle Management on Kubernetes and Istio, which shows safe startup and shutdown. Start despite all dependencies unavailable. Initialize connectivity lazily. Distinguish not yet ready (503 + Retry-After) from unhealthy (500). Degrade gracefully rather than refuse to start.


6.4 Missing Graceful Shutdown

This is another common anti-pattern, e.g., a pod receives SIGTERM and exits immediately, dropping in-flight requests. I have seen it caused a data loss because a locally saved data failed to synchronize with the remote server before the pod was shutdown.

Correct sequence: Stop accepting new connections -> complete in-flight requests (bounded timeout) -> flush async work -> exit. As covered in Zero-Downtime Services with Lifecycle Management, getting any stage wrong produces dropped requests during every deployment.


6.5 No Pre-Authentication Throttling

This occurs when throttling applied only after auth. An attacker sends millions of requests that exhaust authentication infrastructure before per-account quota applies.

Better approach: Lightweight rate limiting before authentication (source IP / API key prefix) as first-line defense. Per-account throttling after auth. Both layers required. Configuration updatable without deployment.


6.6 Shallow Health Checks

I have seen companies touting 99.99% availability where their /health returns 200 as long as the HTTP server is running, regardless of whether the database connection pool is exhausted or the cache is unreachable.

EndpointPurposeChecked by
/health/liveProcess aliveKubernetes liveness probe
/health/readyCan handle requestsReadiness probe, load balancer
/health/deepFull end-to-end validationDeployment pipeline gate

6.7 Insufficient Metrics, SLAs, and Alerting

I wrote about From Code to Production: A Checklist for Reliable, Scalable, and Secure Deployments that shows metrics/alerting must be configured for API deployment. If you have insufficient metrics like only request count and binary error rate tracked without latency percentiles or defined SLA then diagnosing failure will be hard . For example, alerts fire at 100% error rate and the entire service is down before anyone is notified.

Better approach: Instrument every operation with request rate, error rate (4xx vs 5xx), latency at P50/P95/P99/P999, and downstream dependency health. Set alert thresholds below your SLA, e.g. if P99 SLA is 500ms, alert at 400ms.


6.8 No “Big Red Button” and Missing Emergency Rollback

This occurs when there is no fast path to revert a bad deployment. Configuration changes require a full deployment to roll back. No tested runbook.

Better approach: Feature flags togglable without deployment (tested weekly). Sub-5-minute rollback pipeline. Pre-tested load shedding with documented decision thresholds. Runbooks practiced in drills, not just read.


6.9 Backup Communication Channels Not Tested

Incident response plans rely on Slack to coordinate a Slack outage. Runbooks stored in Confluence, down when cloud IAM is broken. For example, Google’s 2017 OAuth outage logged 350M users out of devices and services. Teams expected to coordinate via Google Hangouts, which was also down. Incident coordination was hampered by the incident. Recovery took 12 hours.


6.10 Phased Deployment Anti-Patterns and Missing Automation

This occurs when you deploy globally in a single wave. Rollback criteria is “wait and see.” Canary populations too small. Rollback requires human decision-making at 3 AM. I wrote about Mitigate Production Risks with Phased Deployment that shows how phased deployment can mitigate production releases. Automated phased deployment:

  1. Deploy 1-5% canary
  2. Run automated integration tests against canary
  3. Monitor SLA metrics for bake period (10 minutes)
  4. Auto-rollback if any threshold breaches without human intervention
  5. Promote to next fault boundary only on clean bake

Section 7: Security, Data Privacy & Lifecycle Anti-Patterns


7.1 Missing Boundary Validation: Specs That Don’t Enforce

In this case, an OpenAPI spec exists but is not enforced at runtime and is documentation only. A proto definition marks fields as optional but the service processes requests where required fields are absent and produces undefined behavior. Input validation is implemented inconsistently in business logic rather than at the API boundary.

Better approach: Enforce the spec at the boundary. For OpenAPI/REST: Use middleware that validates every request against the OpenAPI schema before it reaches business logic. Libraries like express-openapi-validator (Node.js), connexion (Python), or API Gateway request validation do this. Every field type, pattern, range, and required constraint in the spec is automatically enforced.

# openapi.yaml — enforced at runtime, not just documentation
components:
  schemas:
    CreateOrderRequest:
      type: object
      required: [customer_id, items]
      properties:
        client_token:
          type: string
          minLength: 16
          maxLength: 128
        customer_id:
          type: string
          pattern: '^cust-[a-z0-9]{8,}$'
        items:
          type: array
          minItems: 1
          maxItems: 100
          items:
            $ref: '#/components/schemas/OrderItem'

For gRPC/Protobuf: Use protoc-gen-validate (PGV), a protobuf plugin that generates validation code from annotations in your .proto files:

import "validate/validate.proto";

message CreateOrderRequest {
  // clientToken: optional but if present must be 16-128 printable ASCII chars
  optional string client_token = 1 [(validate.rules).string = {
    min_len: 16, max_len: 128
  }];

  // customer_id: required, must match pattern
  string customer_id = 2 [(validate.rules).string = {
    pattern: "^cust-[a-z0-9]{8,}$",
    min_len: 1
  }];

  // items: required, 1-100 items
  repeated OrderItem items = 3 [(validate.rules).repeated = {
    min_items: 1, max_items: 100
  }];
}

message OrderItem {
  string product_id = 1 [(validate.rules).string.min_len = 1];

  // quantity: must be positive
  int32 quantity = 2 [(validate.rules).int32.gt = 0];

  // price: must be non-negative
  double unit_price = 3 [(validate.rules).double.gte = 0.0];
}

This enforces validation at the boundary, before your business logic runs, using the same .proto file that is your source of truth. No duplicate validation code. No inconsistency between the spec and the enforcement.


7.2 PII Data Exposure in APIs

This anti-pattern exposes PII data like full credit card numbers, SSNs, or passport numbers returned in GET responses. Email addresses and phone numbers included in audit logs and error messages. User location data exposed in list endpoints without access controls. Responses cached at the CDN layer with no consideration of the PII they contain.

Better approach: Apply data minimization at the API layer and return only the fields a caller needs and is authorized to receive. I wrote Agentic AI for Automated PII Detection: Building Privacy Guardians with LangChain and Vertex AI to show how annotations to mark sensitive fields in your schema and AI agents can be used to detect violations:

import "google/api/field_behavior.proto";

message Customer {
  string customer_id = 1;
  string display_name = 2;

  // Sensitive: only returned to callers with PII_READ permission
  // Masked in logs: shown as "****@example.com"
  string email_address = 3 [
    (google.api.field_behavior) = OPTIONAL,
    // Custom option — your PII classification
    (pii.sensitivity) = HIGH
  ];

  // Never returned in list operations; only in GetCustomer with explicit consent
  string phone_number = 4 [(pii.sensitivity) = HIGH];

  // Tokenized before storage; never returned as plaintext
  string payment_method_token = 5;
}

Operational controls:

  • Never log full request/response bodies; use structured logging with explicit field allowlists
  • Apply response field filtering at the API gateway based on caller permissions
  • Scan API responses in CI/CD pipelines for PII patterns before deployment
  • Ensure pagination tokens do not contain PII
  • Cache keys must never contain PII; cached responses must never contain PII for a different caller

7.3 Missing Contract Testing

In this case, a service team ships an API. Client teams write integration tests against their own mock servers. The mock servers are written from the documentation, not from the actual service behavior. When the service changes, the mocks stay static. Clients discover the breaking change in production.

Consumer-driven contract testing reverses this: clients publish their expectations (the “contract” of what they call and what they expect back), and the service validates those contracts in its CI/CD pipeline. If the service changes in a way that breaks a client contract, the service’s build fails before the change is deployed.

I built an open-source framework specifically for this: api-mock-service and described in Contract Testing for REST APIs. The framework supports:

  • Recording real API traffic and generating mock contracts from it (no manual mock writing)
  • Replaying recorded responses in test environments
  • Validating that recorded behavior matches the current service
  • Contract assertions that run in CI/CD pipelines to catch regressions before deployment
  • Support for REST, gRPC, and asynchronous APIs
# Contract generated from real traffic — not hand-written
contract:
  name: create_order_success
  method: POST
  path: /v1/orders
  request:
    headers:
      Content-Type: application/json
    body:
      customer_id: "{{non_empty_string}}"
      items:
        - product_id: "{{non_empty_string}}"
          quantity: "{{positive_integer}}"
  response:
    status: 201
    body:
      order_id: "{{non_empty_string}}"
      status: PENDING
      created_at: "{{iso_timestamp}}"
  # This contract runs against the service in CI — if CreateOrder
  # changes its response shape, this test fails before deployment

Spec enforcement + contract testing = full boundary defense:

  • The OpenAPI or proto spec enforces what the service accepts
  • Contract tests verify what the service returns
  • Together they eliminate the “it works in mocks but breaks in production” class of failures

7.4 No API Versioning Strategy

There is no version identifier, or a single v1 with no plan for v2. Or major version bumps so frequent clients cannot keep up. For example, Twitter’s v1.0 deprecation gave clients weeks, not months, and broke thousands of integrations.

Better approach: Version from day one in the URL path (/v1/, /v2/). Run old versions in parallel until usage is zero. Communicate sunset timelines with 12+ months’ notice.


7.5 Poor or Missing Documentation

Documentation covers only the happy path. No failure modes, retry semantics, or idempotency semantics documented. Field descriptions say “the order ID” rather than valid values and behavior when absent.

Documentation is a contract: every field, every failure mode, every error code must be documented. Consumer-driven contract tests are a forcing function.


7.6 Insufficient Rate Limiting and Quota Management

In this scenario, no per-account rate limits exist. Rate limits fixed in code, not configurable without deployment. One client’s traffic starves all others. Throttling responses use 500 instead of 429 Too Many Requests with Retry-After.

GitHub’s rate limiting is a reference implementation. X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers in every response allow clients to implement proactive backoff. 429 with Retry-After when the limit is hit.


7.7 Caching Without Security Consideration

Examples of this anti-pattern surfaces include a CDN cached responses by keyed only on URL, serving account A’s private data to account B. Cache stores authorization decisions without accounting for permission revocation.

Better approach: I described best practices of caches in When Caching is not a Silver Bullet. Cache keys must include all authorization context. Authorization decisions must have TTLs reflecting how quickly permission changes take effect. Cache poisoning must be in your threat model.


7.8 No API Lifecycle Management and Missing Deprecation Path

This occurs when there is no process for retiring old API versions. Deprecated endpoints have no documented migration path. Or endpoints removed with insufficient notice. For example, Twilio’s classic API deprecation was managed over 18 months with migration guides, compatibility layers, and direct client outreach.

Better approach: Collect per-endpoint, per-client usage metrics before announcing deprecation. Block new clients. Provide migration docs and tooling. 12+ months’ lead time. Monitor until zero usage confirmed.


Quick Reference: Pre-Launch Checklist

API Design Philosophy

  • [ ] Spec written first (OpenAPI or proto) before any implementation code
  • [ ] OpenAPI/proto schema enforced at runtime boundary (PGV, openapi-validator)
  • [ ] API surface is small and composable; no UI-specific endpoints in the core API
  • [ ] Resources organized in a consistent URI hierarchy under namespaces
  • [ ] No bag-of-params / execute pattern; separate operations for separate actions
  • [ ] Standard protocol chosen (REST, gRPC, WebSocket, SSE), no custom RPC
  • [ ] Encoding chosen based on use case (protobuf binary for internal high-throughput)
  • [ ] Streaming APIs use gRPC streaming or WebSocket, not polling or custom framing

Contract & Consistency

  • [ ] Consistent naming vocabulary (nouns, verbs, field names, timestamps)
  • [ ] Correct HTTP verbs with documented semantics
  • [ ] No breaking changes without version bump
  • [ ] Hyrum’s Law review: what observable behaviors exist not in the contract?
  • [ ] Strict input validation on every field, every operation

Pagination & Filtering

  • [ ] Pagination on all list operations before first client, not after
  • [ ] Opaque, versioned, expiring, account-scoped pagination tokens
  • [ ] Filter semantics documented (AND across attributes, OR within values)

Idempotency & Transactions

  • [ ] clientToken on all create operations
  • [ ] Token mismatch returns 409 with conflicting resource ID
  • [ ] Transaction boundaries documented
  • [ ] PATCH implements partial update (field mask)
  • [ ] ETag / version token for optimistic concurrency

Error Handling

  • [ ] Structured error format with machine-readable codes
  • [ ] No internal implementation detail in error messages
  • [ ] Correct HTTP status codes; seven standard exception types
  • [ ] 404 vs 403: resource existence hidden from unauthorized callers

Security & Privacy

  • [ ] PII tagged in schema; data minimization applied per-endpoint
  • [ ] No PII in logs, error messages, or pagination tokens
  • [ ] PII scanning in CI/CD pipeline before deployment
  • [ ] Cache keys include authorization context

Resilience & Operations

  • [ ] Retry logic limited to idempotent or token-protected operations
  • [ ] Exponential backoff with jitter; Retry-After honored
  • [ ] Service starts despite all dependencies unavailable
  • [ ] Graceful shutdown tested (SIGTERM -> drain -> exit)
  • [ ] Pre-auth throttling + per-account quota + 429 with Retry-After
  • [ ] Three-layer health checks: live / ready / deep
  • [ ] Latency SLAs defined; alerts below SLA threshold
  • [ ] Phased deployment with automatic metric-gated rollback
  • [ ] Big Red Button identified, documented, and drill-tested
  • [ ] Backup incident communication channel tested independently

Contract Testing & Lifecycle

  • [ ] Contract tests generated from real traffic, run in CI/CD
  • [ ] API version in URL path (v1, v2) from day one
  • [ ] Documentation covers failure modes, idempotency, retry semantics
  • [ ] Usage metrics collected per endpoint for lifecycle decisions
  • [ ] Deprecation policy documented; sunset timelines published

Closing Thoughts

Above anti-patterns are based on my decades of experience in building and operating high traffic APIs. They share a common thread: they were invisible at design time, or the team assumed fixing them later would be cheaper. An idempotency contract is cheapest to design correctly before the first client. A spec-first approach catches URI design problems before any client builds against the wrong shape. A contract test catches breaking changes before deployment. The checklist above addresses these as a system because they compound. An unbounded response is worse with no pagination. A missing idempotency token is catastrophic with an aggressive retry policy. A leaky PII field is worse without boundary validation. Two practices matter more than any individual anti-pattern on this list:

  • Spec-first design: write the contract before writing the implementation. Review it with consumers before coding starts. Use it as the source of truth for both server stubs and client SDKs.
  • Contract testing: verify the contract continuously against the live service. Use recorded real traffic, not hand-written mocks. Run it in every CI/CD pipeline.

Further reading from this series:

March 22, 2026

Generative and Agentic AI Design Patterns

Filed under: Computing — admin @ 8:29 pm

Over the past year I’ve built production agentic systems across several domains and shared what I learned along the way: production-grade AI agents with MCP and A2A, a daily minutes assistant with RAG, MCP, and ReAct, rebuilding fintech infrastructure with ReAct and local models, automated PII detection with LangChain and Vertex AI, and API compatibility guardians with LangGraph. I have learned a lot building those systems through trial and error. In this blog, I will share a set of generative and agentic AI patterns I have learned from reading Generative AI Design Patterns and Agentic Design Patterns. I have built hands on python examples from these patterns and built github.com/bhatti/agentic-patterns for running agentic apps locally via Ollama with open-source models (Qwen, DeepSeek, Llama, Mistral). Each pattern in the repo includes a README, working code, real-world use cases, and best practices.


Quick Start

git clone https://github.com/bhatti/agentic-patterns
cd agentic-patterns
pip install -r requirements.txt
ollama pull llama3
cd patterns/logits-masking && python example.py

See SETUP.md for full setup instructions.


Table of Contents


Category 1: Content & Style Control

The first five patterns control and optimize content generation, style, and format:

Pattern 1: Logits Masking

Category: Content & Style Control
Use When: You need to enforce constraints during generation (e.g., valid JSON, banned words)

Problem

When generating structured outputs (like JSON, code, or formatted text), language models can produce invalid sequences that don’t conform to required style rules, schemas, or constraints.

Solution

Logits Masking intercepts the model’s token generation process to enforce constraints during sampling. Three key steps:

  1. Intercept Sampling — Modify logits before token selection
  2. Zero Out Invalid Sequences — Mask invalid tokens (set logits to -inf)
  3. Backtracking — Revert to checkpoint if invalid sequence detected

Use Cases

  • API response generation (ensure valid JSON)
  • Code generation (enforce style guidelines)
  • Content moderation (prevent banned words)
  • Structured data extraction (match specific formats)

Constraints: Requires access to model logits (not available in all APIs). State tracking can be complex for nested structures. Performance overhead from logits processing.

Tradeoffs:

  • ? Prevents invalid generation at source
  • ? More efficient than post-processing
  • ?? More complex than simple validation
  • ?? May limit model creativity

Code Snippet

class JSONLogitsProcessor(LogitsProcessor):
    """Intercept logits and mask invalid JSON tokens."""

    def __call__(self, input_ids, scores):
        # STEP 1: Intercept sampling
        current_text = self.tokenizer.decode(input_ids[0])

        # STEP 2: Zero out invalid sequences
        for token_id in range(scores.shape[-1]):
            if not self._is_valid_json_token(token_id, current_text):
                scores[0, token_id] = float('-inf')  # Mask invalid

        return scores

Full Example: patterns/logits-masking/example.py


Pattern 2: Grammar Constrained Generation

Category: Content & Style Control
Use When: You need outputs that conform to formal grammar specifications

Problem

Language models often produce text that doesn’t conform to required formats, schemas, or grammars. Unlike simple masking, grammar-constrained generation ensures outputs follow formal grammar specifications.

Solution

Grammar Constrained Generation uses formal grammar specifications to guide token generation. Three implementation approaches:

  1. Grammar-Constrained Logits Processor — Use EBNF grammar to create processor
  2. Standard Data Format — Leverage JSON/XML with existing validators
  3. User-Defined Schema — Use custom schemas (JSON Schema, Pydantic)

Use Cases

  • API configuration generation (OpenAPI specs)
  • Configuration files (YAML, TOML that must parse)
  • Database queries (SQL with guaranteed syntax)
  • Code generation (must compile/parse)

Constraints: Requires grammar definition or schema. Grammar parsing can be computationally expensive. Complex grammars may limit generation speed.

Tradeoffs:

  • ? Guarantees grammatical correctness
  • ? Works with existing schema languages
  • ?? More complex than simple masking
  • ?? May require grammar expertise

Code Snippet

# Option 1: Formal Grammar
grammar = """
root        ::= endpoint_config
endpoint_config ::= "{" ws endpoint_def ws "}"
endpoint_def    ::= '"endpoint"' ws ":" ws endpoint_obj
"""

# Option 2: JSON Schema
schema = {
    "type": "object",
    "required": ["endpoint"],
    "properties": {
        "endpoint": {
            "type": "object",
            "required": ["name", "method", "path"]
        }
    }
}

# Apply grammar constraints during generation
processor = GrammarConstrainedProcessor(grammar, tokenizer)
logits = processor(input_ids, logits)

Full Example: patterns/grammar/example.py


Pattern 3: Style Transfer

Category: Content & Style Control
Use When: You need to transform content from one style to another

Problem

Content often needs to be transformed from one style to another while preserving core information. Manual rewriting is time-consuming and inconsistent.

Solution

Style Transfer uses AI to transform content between styles. Two approaches:

  1. Few-Shot Learning — Use example pairs in prompt (no training)
  2. Model Fine-Tuning — Fine-tune model on style pairs

Use Cases

  • Professional communication (notes to emails)
  • Content adaptation (academic to blog posts)
  • Brand voice (maintain consistent tone)
  • Platform adaptation (different social media styles)

Constraints: Few-shot limited by context window. Fine-tuning requires training data. Style consistency can vary.

Tradeoffs:

  • ? Few-shot: Quick, no training needed
  • ? Fine-tuning: Better consistency
  • ?? Few-shot: May not capture nuances
  • ?? Fine-tuning: Requires data collection

Code Snippet

# Option 1: Few-Shot Learning
examples = [
    StyleExample(
        input_text="urgent: need meeting minutes by friday",
        output_text="Subject: Urgent: Meeting Minutes Needed\n\nDear [Recipient],\n\n..."
    )
]

transfer = FewShotStyleTransfer(examples)
result = transfer.transfer_style("quick update: deadline moved")

# Option 2: Fine-Tuning
training_data = [
    {"prompt": "Convert notes to email", "completion": "Professional email..."}
]
fine_tuned_model = fine_tune_model(base_model, training_data)

Full Example: patterns/style-transfer/example.py


Pattern 4: Reverse Neutralization

Category: Content & Style Control
Use When: You need to generate content in a specific personal style that zero-shot can’t capture

Problem

When you need content in a specific, personalized style, zero-shot prompting fails because the model doesn’t know your unique writing style.

Solution

Reverse Neutralization uses a two-stage fine-tuning approach:

  1. Generate Neutral Form — Create content in neutral, standardized format
  2. Fine-Tune Style Converter — Train model to convert neutral ? your style
  3. Inference — Use fine-tuned model for style conversion

Use Cases

  • Personal blog writing (technical content to your style)
  • Brand voice (consistent voice across content)
  • Documentation style (match organization’s style guide)
  • Communication templates (your personal email style)

Constraints: Requires fine-tuning. Needs training data (neutral ? style pairs). Two-stage process.

Tradeoffs:

  • ? Learns your specific style
  • ? Consistent results
  • ? Captures personal nuances
  • ?? Requires data collection and training
  • ?? Less flexible (need retraining to change style)

Code Snippet

# Step 1: Generate neutral form
neutral_generator = NeutralGenerator()
neutral = neutral_generator.generate_neutral("API Authentication")

# Step 2-3: Create training dataset and fine-tune
pairs = [
    StylePair(neutral="Technical doc...", styled="Your blog style...")
]
fine_tuned_model = fine_tune_on_preferences(pairs)

# Step 4: Use fine-tuned model
converter = StyleConverter(fine_tuned_model)
styled = converter.convert_to_style(neutral)

Full Example: patterns/reverse-neutralization/example.py


Pattern 5: Content Optimization

Category: Content & Style Control
Use When: You need to optimize content for specific performance goals (e.g., open rates, conversions)

Problem

When creating content for specific purposes, you need to optimize for outcomes. Traditional A/B testing is limited — it’s manual, time-consuming, and doesn’t learn patterns.

Solution

Content Optimization uses preference-based fine-tuning (DPO) to train a model to generate content that wins in comparisons:

  1. Generate Pair — Create two variations from same prompt
  2. Compare — Test and pick winner based on metrics
  3. Create Dataset — Collect preference pairs (prompt, chosen, rejected)
  4. Fine-Tune with DPO — Train model on preferences
  5. Use Optimized Model — Generate better-performing content

Use Cases

  • Email marketing (optimize subject lines for open rates)
  • E-commerce (optimize product descriptions for conversions)
  • Social media (optimize posts for engagement)
  • Landing pages (optimize copy for sign-ups)

Constraints: Requires preference data collection. DPO fine-tuning is computationally intensive. Need clear optimization metrics.

Tradeoffs:

  • ? Learns from all comparisons
  • ? Scales to many variations
  • ? Model internalizes winning patterns
  • ?? Requires training data (100+ pairs)
  • ?? More complex than A/B testing

Code Snippet

# Step 1: Generate pair
generator = ContentGenerator()
var_a, var_b = generator.generate_pair("New product launch")

# Step 2: Compare and pick winner
comparator = ContentComparator(optimization_goal="open_rate")
pair = comparator.compare(ContentPair(prompt, var_a, var_b))

# Step 3-4: Create dataset and fine-tune
preferences = [PreferenceExample(prompt, chosen, rejected)]
dpo_trainer = PreferenceTuner()
optimized_model = dpo_trainer.fine_tune(preferences)

# Step 5: Use optimized model
optimized_generator = OptimizedContentGenerator(optimized_model)
result = optimized_generator.generate_optimized("Newsletter")

Full Example: patterns/content-optimization/example.py


Category 2: Adding Knowledge / RAG Stack

Patterns 6–12 augment LLMs with external knowledge sources for accessing up-to-date information, private data, and knowledge beyond the model’s training cutoff.


Pattern 6: Basic RAG (Retrieval-Augmented Generation)

Category: Adding Knowledge
Use When: You need to augment LLM responses with external knowledge sources

Problem

LLMs have three key knowledge limitations:

  • Static Knowledge Cutoff — Trained on data up to a specific date
  • Model Capacity Limits — Can’t store all knowledge in parameters
  • Lack of Private Data Access — No access to internal documents or databases

Solution

Basic RAG uses trusted knowledge sources when generating LLM responses. Two pipelines:

Indexing Pipeline (preparatory):

  • Load documents ? Chunk into manageable pieces ? Store in searchable index

Retrieval-Generation Pipeline (runtime):

  • Retrieve relevant chunks for query ? Ground prompt with retrieved context ? Generate response using LLM

Use Cases

  • Product documentation (answer questions about features/APIs)
  • Company knowledge base (query internal wikis/policies)
  • Customer support (accurate answers from support docs)
  • Research assistance (search through papers/documents)
  • Legal/compliance (query regulations/guides)

Tradeoffs:

  • ? Access to up-to-date and private knowledge
  • ? Can handle large knowledge bases
  • ? Transparent (can cite sources)
  • ?? Requires indexing infrastructure
  • ?? Retrieval quality affects response quality

Code Snippet

# INDEXING PIPELINE
loader = DocumentLoader()
documents = loader.load_documents("product_docs")

splitter = TextSplitter(chunk_size=500, chunk_overlap=50)
chunks = []
for doc in documents:
    chunks.extend(splitter.split_document(doc))

index = Index()
index.add_chunks(chunks)

# RETRIEVAL-GENERATION PIPELINE
retriever = Retriever(index, top_k=3)
generator = RAGGenerator(retriever)

result = generator.generate("How do I authenticate with the API?")
# Returns answer with source citations

Full Example: patterns/basic-rag/example.py


Pattern 7: Semantic Indexing

Category: Adding Knowledge
Use When: You need semantic understanding beyond keywords, or have complex content (images, tables, code)

Problem

Traditional keyword-based indexing has limitations:

  • Semantic Understanding — Misses meaning (“car” and “automobile” are different keywords)
  • Complex Content — Struggles with images, tables, code blocks, structured data
  • Context Loss — Fixed-size chunking breaks up related content
  • Multimedia — Can’t effectively index images, videos, or other media

Solution

Semantic Indexing uses embeddings (vector representations) to capture meaning:

  1. Embeddings — Encode text/images into fixed vector representations for semantic meaning
  2. Semantic Chunking — Divide text into meaningful segments based on semantic content
  3. Image/Video Handling — Use OCR or vision models for embedding generation
  4. Table Handling — Organize and extract key information from structured data
  5. Contextual Retrieval — Preserve context with hierarchical chunking
  6. Hierarchical Chunking — Multi-level chunking (document ? section ? paragraph)

Use Cases

  • Technical documentation (code examples, API docs, tutorials)
  • Research papers (find by concept, not keywords)
  • Product catalogs (search by features, not names)
  • Multimedia content (images, videos with descriptions)

Code Snippet

# CONCEPT 1: EMBEDDINGS
from sentence_transformers import SentenceTransformer
import math

class EmbeddingGenerator:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def generate_embedding(self, text: str) -> List[float]:
        return self.model.encode(text).tolist()

    def cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        magnitude1 = math.sqrt(sum(a * a for a in vec1))
        magnitude2 = math.sqrt(sum(a * a for a in vec2))
        return dot_product / (magnitude1 * magnitude2) if magnitude1 * magnitude2 > 0 else 0.0

# CONCEPT 2: SEMANTIC CHUNKING
@dataclass
class SemanticChunk:
    id: str
    text: str
    embedding: Optional[List[float]] = None
    chunk_type: str = "text"  # text, code, table, image
    parent_id: Optional[str] = None
    children_ids: List[str] = None

class SemanticChunker:
    def chunk_by_structure(self, content: str) -> List[SemanticChunk]:
        """Chunk respecting document structure (headers, sections, paragraphs)."""
        chunks = []
        sections = re.split(r'\n(#{2,3}\s+.+?)\n', content)
        current_section = None
        chunk_index = 0
        for i, part in enumerate(sections):
            if part.strip().startswith('#'):
                if current_section:
                    chunks.append(SemanticChunk(id=f"chunk-{chunk_index}", text=current_section))
                    chunk_index += 1
                current_section = part + "\n"
            else:
                current_section = (current_section or "") + part
        if current_section:
            chunks.append(SemanticChunk(id=f"chunk-{chunk_index}", text=current_section))
        return chunks

# CONCEPTS 5 & 6: HIERARCHICAL CHUNKING & CONTEXTUAL RETRIEVAL
class ContextualRetriever:
    def retrieve_with_context(self, query: str, top_k: int = 3,
                              include_context: bool = True) -> List[SemanticChunk]:
        query_embedding = self.embedding_generator.generate_embedding(query)
        scored_chunks = []
        for chunk in self.chunks.values():
            if chunk.embedding:
                similarity = self.embedding_generator.cosine_similarity(
                    query_embedding, chunk.embedding
                )
                scored_chunks.append((similarity, chunk))
        scored_chunks.sort(key=lambda x: x[0], reverse=True)
        top_chunks = [chunk for _, chunk in scored_chunks[:top_k]]
        if include_context:
            contextual_chunks = []
            for chunk in top_chunks:
                contextual_chunks.append(chunk)
                # Add parent for context
                if chunk.parent_id and chunk.parent_id in self.chunks:
                    parent = self.chunks[chunk.parent_id]
                    if parent not in contextual_chunks:
                        contextual_chunks.append(parent)
                # Add children for detail
                for child_id in (chunk.children_ids or []):
                    if child_id in self.chunks:
                        child = self.chunks[child_id]
                        if child not in contextual_chunks:
                            contextual_chunks.append(child)
            return contextual_chunks
        return top_chunks

Full Example: patterns/semantic-indexing/example.py


Pattern 8: Indexing at Scale

Category: Adding Knowledge
Use When: Your RAG system needs to handle large-scale knowledge bases with evolving, time-sensitive information

Problem

RAG systems in production face critical challenges as knowledge bases grow:

  • Data Freshness — Recent findings obsolete old guidelines
  • Contradictory Content — Multiple versions of information cause confusion
  • Outdated Content — Old information remains in index, leading to incorrect answers

Solution

Indexing at Scale uses metadata and temporal awareness:

  1. Document Metadata — Use timestamps, version numbers, source information
  2. Temporal Tagging — Tag chunks with creation/update dates, expiration dates
  3. Contradiction Detection — Identify and prioritize newer over older contradictory content
  4. Outdated Content Management — Automatically deprecate or flag outdated information

Code Snippet

@dataclass
class TemporalMetadata:
    created_at: datetime
    updated_at: datetime
    expires_at: Optional[datetime] = None
    version: str = "1.0"
    source: str = ""
    authority: str = "medium"  # high, medium, low

class ContradictionDetector:
    def _resolve_contradiction(self, chunk_a, chunk_b):
        # Prefer newer date
        if chunk_a.metadata.updated_at > chunk_b.metadata.updated_at:
            return "chunk_a"
        # If same date, prefer higher authority
        if chunk_a.metadata.authority > chunk_b.metadata.authority:
            return "chunk_a"
        return "chunk_b"

# KNOWLEDGE BASE WITH TEMPORAL AWARENESS
kb = HealthcareGuidelinesKB()
kb.add_guideline(
    content="CDC recommends masks required in public",
    source="CDC",
    date=datetime(2021, 7, 15),
    authority="high"
)

result = kb.query("Should I wear a mask?", prefer_recent=True)
# Returns most recent guidelines, flags contradictions

Full Example: patterns/indexing-at-scale/example.py


Pattern 9: Index-Aware Retrieval

Category: Adding Knowledge
Use When: Basic RAG fails due to vocabulary mismatches, fine details, or holistic answers requiring multiple concepts

Problem

Users ask questions in natural language (“How do I log in?”), but your API documentation uses technical terminology (“OAuth 2.0 authentication”, “access token”). Basic RAG fails because “log in” ? “authentication” ? “OAuth 2.0”.

Solution

Index-Aware Retrieval uses four advanced retrieval techniques:

  1. Hypothetical Document Embedding (HyDE) — Generate hypothetical answer first, then match chunks to that answer
  2. Query Expansion — Translate user terms to technical terms used in chunks
  3. Hybrid Search — Combine keyword (BM25) and semantic (embedding) search with weighted average
  4. GraphRAG — Store documents in graph database, retrieve related chunks after finding initial match

Code Snippet

# TECHNIQUE 1: HYPOTHETICAL DOCUMENT EMBEDDING (HyDE)
class HyDEGenerator:
    def retrieve_with_hyde(self, query: str, chunks: List[DocumentChunk], top_k: int = 3):
        # Step 1: Generate hypothetical answer
        hypothetical_answer = self.generate_hypothetical_answer(query)
        # "To authenticate, use OAuth 2.0 access token..."

        # Step 2: Embed hypothetical answer (not original query)
        hyde_embedding = embedding_generator.generate_embedding(hypothetical_answer)

        # Step 3: Find chunks similar to hypothetical answer
        scored_chunks = []
        for chunk in chunks:
            similarity = cosine_similarity(hyde_embedding, chunk.embedding)
            scored_chunks.append((chunk, similarity))

        return sorted(scored_chunks, key=lambda x: x[1], reverse=True)[:top_k]

# TECHNIQUE 2: QUERY EXPANSION
class QueryExpander:
    def expand_query(self, query: str) -> str:
        term_translations = {
            "log in": ["authentication", "oauth", "access token"],
            "error": ["error code", "status code", "exception"]
        }
        expanded_terms = [query]
        for user_term, tech_terms in term_translations.items():
            if user_term in query.lower():
                expanded_terms.extend(tech_terms)
        return " ".join(expanded_terms)

# TECHNIQUE 3: HYBRID SEARCH (BM25 + Semantic)
class HybridRetriever:
    def retrieve(self, query: str, top_k: int = 5):
        bm25_score = bm25_scorer.score(query, chunk)
        semantic_score = cosine_similarity(query_embedding, chunk.embedding)
        # ? = 0.4 means 40% BM25, 60% semantic
        hybrid_score = 0.4 * bm25_score + 0.6 * semantic_score
        return sorted_chunks_by_score[:top_k]

# TECHNIQUE 4: GRAPHRAG
class GraphRAG:
    def retrieve_related(self, initial_chunk_id: str, depth: int = 1):
        related_ids = graph[initial_chunk_id]
        for _ in range(depth - 1):
            next_level = [graph[rid] for rid in related_ids]
            related_ids.extend(next_level)
        return [chunks[cid] for cid in related_ids]

Full Example: patterns/index-aware-retrieval/example.py


Pattern 10: Node Postprocessing

Category: Adding Knowledge
Use When: Retrieved chunks have issues like ambiguous entities, conflicting content, obsolete information, or are too verbose

Problem

Your RAG system retrieves legal document chunks with issues: ambiguous entities (“Apple” could be company or fruit), conflicting interpretations of the same law, obsolete regulations superseded by new ones, and verbose chunks with only small relevant sections.

Solution

Node Postprocessing improves retrieved chunks through a pipeline:

  1. Reranking — Use more accurate models (like BGE) to rerank chunks
  2. Hybrid Search — Combine BM25 and semantic retrieval
  3. Query Expansion and Decomposition — Expand queries and break into sub-queries
  4. Filtering — Remove obsolete, conflicting, or irrelevant chunks
  5. Contextual Compression — Extract only relevant parts from verbose chunks
  6. Disambiguation — Resolve ambiguous entities and clarify context

Code Snippet

# TECHNIQUE 1: RERANKING (BGE-style Cross-Encoder)
# In production: from sentence_transformers import CrossEncoder
# model = CrossEncoder('BAAI/bge-reranker-base')

# TECHNIQUE 5: CONTEXTUAL COMPRESSION
class ContextualCompressor:
    def compress(self, chunk: DocumentChunk, query: str, max_length: int = 200):
        query_words = set(query.lower().split())
        sentences = chunk.content.split('.')
        relevant_sentences = [
            s for s in sentences
            if query_words & set(s.lower().split())
        ]
        compressed_content = '. '.join(relevant_sentences[:3]) + '.'
        return DocumentChunk(id=chunk.id + "_compressed", content=compressed_content[:max_length])

# TECHNIQUE 6: DISAMBIGUATION
class Disambiguator:
    def disambiguate(self, chunks: List[DocumentChunk], query: str):
        entity_contexts = {
            "apple": {
                "company": ["technology", "iphone", "corporate"],
                "fruit": ["nutrition", "eating", "food"]
            }
        }
        query_words = set(query.lower().split())
        for chunk in chunks:
            for entity, contexts in entity_contexts.items():
                if entity in chunk.content.lower():
                    entity_type = determine_from_context(entity, query_words, chunk.content)
                    if entity_type:
                        chunk.entities.append(f"{entity}:{entity_type}")
        return chunks

# COMPLETE POSTPROCESSING PIPELINE
def query_with_postprocessing(question: str):
    expanded = query_processor.expand_query(question)
    candidates = hybrid_retriever.retrieve(expanded, top_k=10)
    filtered = filter.filter_obsolete([c for c, _ in candidates])
    filtered = filter.filter_by_relevance(candidates, threshold=0.3)
    reranked = reranker.rerank(question, filtered, top_k=5)
    disambiguated = disambiguator.disambiguate([c for c, _ in reranked], question)
    compressed = [compressor.compress(c, question) for c in disambiguated]
    return compressed

Full Example: patterns/node-postprocessing/example.py


Pattern 11: Trustworthy Generation

Category: Adding Knowledge
Use When: RAG systems need to build user trust by preventing hallucination, providing citations, and detecting out-of-domain queries

Problem

Users lose trust because the system answers questions outside its knowledge domain, answers lack citations, and it provides confident answers when retrieval actually failed.

Solution

Trustworthy Generation builds user trust through multiple mechanisms:

  1. Out-of-Domain Detection — Detect when knowledge base doesn’t contain relevant information
  2. Embedding Distance Checking — Measure similarity between query and retrieved chunks
  3. Citations — Provide source citations for all factual claims
  4. Self-RAG Workflow — 6-step self-reflective process to verify responses
  5. Guardrails — Prevent generation of unsafe or unreliable content

Code Snippet

# OUT-OF-DOMAIN DETECTION
class OutOfDomainDetector:
    def is_out_of_domain(self, query: str, chunks: List[DocumentChunk]) -> Tuple[bool, str]:
        if chunks:
            query_embedding = embedding_generator.generate_embedding(query)
            min_distance = min([
                1 - cosine_similarity(query_embedding, chunk.embedding)
                for chunk in chunks
            ])
            if min_distance > threshold:
                return True, "Query too far from knowledge base"
        if not has_domain_keywords(query):
            return True, "Query lacks domain-specific terminology"
        if not chunks:
            return True, "No relevant chunks found"
        return False, ""

# SELF-RAG WORKFLOW (6 Steps)
class SelfRAGProcessor:
    def process(self, query: str, retrieved_chunks: List[DocumentChunk]):
        # STEP 1: Generate initial response
        initial_response = generate_initial_response(query, retrieved_chunks)
        # STEP 2: Chunk the response
        response_chunks = chunk_response(initial_response)
        # STEP 3: Check whether chunk needs citation
        for chunk in response_chunks:
            chunk.needs_citation = needs_citation(chunk.text)
        # STEP 4: Lookup sources
        for chunk in response_chunks:
            if chunk.needs_citation:
                chunk.sources = lookup_sources(chunk.text, retrieved_chunks)
        # STEP 5: Incorporate citations
        final_response = incorporate_citations(response_chunks)
        # STEP 6: Add warnings
        warnings = generate_warnings(response_chunks)
        return {"response": final_response, "warnings": warnings}

# COMPLETE TRUSTWORTHY GENERATION PIPELINE
def query_with_trustworthiness(question: str):
    is_ood, reason = out_of_domain_detector.is_out_of_domain(question, chunks)
    if is_ood:
        return {"response": f"Cannot answer: {reason}", "out_of_domain": True}
    result = self_rag.process(question, retrieved_chunks)
    passed, reason = guardrails.check(question, result, retrieved_chunks)
    if not passed:
        result["response"] = f"Cannot provide reliable answer: {reason}"
    return result

Full Example: patterns/trustworthy-generation/example.py


Pattern 12: Deep Search

Category: Adding Knowledge
Use When: Complex information needs require iterative retrieval, multi-hop reasoning, or comprehensive research across multiple sources

Problem

Investment analysts need comprehensive research on companies/industries. Basic RAG retrieves a few chunks and provides incomplete answers. They need a system that iteratively explores multiple sources, identifies gaps, and follows up on missing information.

Solution

Deep Search uses an iterative loop that retrieves and thinks until a good enough answer is found or a time/cost budget is exhausted:

Code Snippet

class DeepSearchOrchestrator:
    def __init__(self, budget: Budget):
        self.retriever = MultiSourceRetriever()  # Web, APIs, knowledge bases
        self.reasoner = LLMReasoner()
        self.budget = budget  # Time/cost constraints

    def search(self, query: str, depth: int = 2) -> DeepSearchResult:
        root_section = self._create_section(query)
        sections = [root_section]
        sections_to_expand = [root_section]
        current_depth = 0

        while current_depth < depth:
            current_depth += 1
            exhausted, reason = self.budget.is_exhausted()
            if exhausted:
                break
            next_sections = []
            for section in sections_to_expand:
                gaps = self.reasoner.identify_gaps(query, section.answer, section.sources)
                follow_ups = self.reasoner.generate_follow_ups(query, gaps)
                for follow_up in follow_ups:
                    subsection = self._create_section(follow_up)
                    section.subsections.append(subsection)
                    sections.append(subsection)
                    next_sections.append(subsection)
            sections_to_expand = next_sections
            is_good_enough, quality = self.reasoner.assess_answer_quality(
                query, root_section.answer, sections
            )
            if is_good_enough:
                break

        final_answer = self.reasoner.final_synthesis(query, sections)
        return DeepSearchResult(query, final_answer, sections, self.all_sources)

@dataclass
class Budget:
    max_iterations: int = 5
    max_time_seconds: float = 60.0
    max_cost_dollars: float = 1.0

    def is_exhausted(self) -> Tuple[bool, str]:
        if self.iterations_used >= self.max_iterations:
            return True, "max_iterations"
        if self.time_used >= self.max_time_seconds:
            return True, "max_time"
        if self.cost_used >= self.max_cost_dollars:
            return True, "max_cost"
        return False, ""

# USAGE
analyst = MarketResearchAnalyst()
result = analyst.research(
    query="What factors should I consider when evaluating TechCorp as an investment?",
    max_iterations=10,
    max_time_seconds=30.0
)

Full Example: patterns/deep-search/example.py


Category 3: LLM Reasoning

Patterns 13–16 address reasoning and task specialization: how to get step-by-step or multi-path reasoning from LLMs.


Pattern 13: Chain of Thought (CoT)

Category: LLM Reasoning
Use When: Problems require multistep reasoning, logical deduction, or an auditable reasoning trace

Problem

Foundational models suffer from critical limitations on math, logical deduction, and sequential reasoning:

  • Zero-shot often fails when the problem requires multistep reasoning
  • Black-box answers with no insight into how the conclusion was reached
  • Misinterpretation of rules

Solution

Chain of Thought (CoT) prompts request a step-by-step reasoning process before the final answer. Three variants:

  1. Zero-shot CoT — Append “Think step by step” (no examples)
  2. Few-shot CoT — Provide example (question ? step-by-step reasoning ? answer). RAG gives fish; few-shot CoT shows how to fish.
  3. Auto CoT — Sample questions ? generate reasoning for each with zero-shot CoT ? use as few-shot examples for the actual query

Code Snippet

# VARIANT 1: ZERO-SHOT COT
ZERO_SHOT_COT_SUFFIX = "\n\nThink step by step. Show your reasoning and then state the final conclusion."

def zero_shot_cot(policy: str, case_description: str, question: str, llm=None) -> CoTResult:
    prompt = f"{policy}\n\nCase: {case_description}\n\nQuestion: {question}{ZERO_SHOT_COT_SUFFIX}"
    full_response = llm(prompt)
    return CoTResult(question=question, reasoning=..., conclusion=..., variant="zero_shot")

# VARIANT 2: FEW-SHOT COT — "show how to fish"
FEW_SHOT_EXAMPLES = """
Example 1:
Q: Customer purchased 10 days ago, unopened, has receipt. Eligible for full refund?
A: Step 1: Within 30 days? Yes. Step 2: Unopened? Yes. Step 3: Receipt? Yes.
   Conclusion: Yes, full refund.
"""
def few_shot_cot(policy: str, case_description: str, question: str, llm=None) -> CoTResult:
    prompt = f"{policy}\n\n{FEW_SHOT_EXAMPLES}\n\nNew question:\nQ: {question}\n\nCase: {case_description}\n\nA:"
    return ...

# VARIANT 3: AUTO COT — build few-shot automatically
def auto_cot(policy: str, case_description: str, question: str, num_demos: int = 2, llm=None) -> CoTResult:
    demos = []
    for sample_q in question_pool[:num_demos]:
        response = llm(f"{policy}\n\nQuestion: {sample_q}\n\nThink step by step.")
        demos.append(f"Q: {sample_q}\nA:\n{response}\n")
    prompt = f"{policy}\n\n" + "\n".join(demos) + f"\n\nNew question:\nQ: {question}\n\nCase: {case_description}\n\nA:"
    return ...

# REFUND ELIGIBILITY ADVISOR
advisor = RefundEligibilityAdvisor(policy=REFUND_POLICY)
result = advisor.check_eligibility(case, variant="few_shot")  # zero_shot | few_shot | auto_cot
# result.reasoning, result.conclusion

Full Example: patterns/chain-of-thought/example.py


Pattern 14: Tree of Thoughts (ToT)

Category: LLM Reasoning
Use When: Strategic tasks with multiple plausible paths; single linear CoT is insufficient

Problem

Many tasks that demand strategic thinking cannot be solved by a single multistep reasoning path:

  • Single-path limitation — CoT follows one sequence; if that path is wrong, the answer suffers
  • Branching decisions — Multiple plausible next steps
  • Need for exploration — Best solution often requires exploring several directions

Solution

Tree of Thoughts treats problem-solving as tree search with four components:

  1. Thought generation — From current state, generate N possible next steps
  2. Path evaluation — Score each partial solution (0–100) for promise
  3. Beam search (top K) — Keep only the top K states; prune the rest
  4. Summary generation — Produce a concise summary and answer from the best path

Code Snippet

class TreeOfThoughts:
    def generate_thoughts(self, state: str, step: int, problem: str) -> List[str]:
        """Generate N possible next thoughts from current state."""
        return thoughts

    def evaluate_state(self, state: str, problem: str) -> float:
        """Score path promise (0-1). Correctness, progress, potential."""
        return score

    def solve(self, problem: str) -> ToTResult:
        beam = [(0.5, initial_state, [], 0)]
        for step in range(1, self.max_steps + 1):
            candidates = []
            for score, state, path, _ in beam:
                thoughts = self.generate_thoughts(state, step, problem)
                for thought in thoughts:
                    new_state = state + "\nStep N: " + thought
                    new_score = self.evaluate_state(new_state, problem)
                    candidates.append((new_score, new_state, path + [thought], step))
            beam = sorted(candidates, key=lambda x: -x[0])[:self.beam_width]
        best_state, best_path = beam[0]
        summary = self.generate_summary(problem, best_state)
        return ToTResult(..., solution_summary=summary, reasoning_path=best_path)

# INCIDENT ROOT-CAUSE ANALYZER
analyzer = IncidentRootCauseAnalyzer()
result = analyzer.analyze("API latency spiked; DB, cache, dependencies in use.")
# result.solution_summary, result.reasoning_path

Full Example: patterns/tree-of-thoughts/example.py


Pattern 15: Adapter Tuning

Category: LLM Reasoning
Use When: You need a foundation model to perform a specialized task with a small dataset and want to keep base weights frozen while training only a small adapter (e.g., LoRA)

Problem

Incoming tickets must be routed to billing, technical, sales, or general. Prompt-only classification can be brittle. Adapter tuning trains a small task-specific head on a few hundred labeled tickets while keeping the foundation model frozen.

Solution

Adapter tuning (PEFT) has three key aspects:

  1. Teaches the foundation model a specialized task — Train on input-output pairs
  2. Foundation weights frozen; only a small adapter is updated — LoRA or adapter layers are trained
  3. Training dataset can be smaller — Often a few hundred to a few thousand high-quality pairs suffice

Code Snippet

class TicketIntentRouter:
    def __init__(self):
        self._pipeline = Pipeline([
            ("foundation", TfidfVectorizer(max_features=2000)),  # frozen after fit
            ("adapter", LogisticRegression(max_iter=500)),       # only this is "trained"
        ])

    def train(self, examples: List[TicketExample]) -> None:
        texts = [ex.text for ex in examples]
        labels = [ex.intent for ex in examples]
        self._pipeline.fit(texts, labels)

    def predict(self, text: str) -> AdapterTuningResult:
        pred = self._pipeline.predict([text])[0]
        probs = self._pipeline.predict_proba([text])[0]
        return AdapterTuningResult(intent=pred, confidence=float(probs.max()))

router = TicketIntentRouter()
router.train(train_examples)  # 200–2000 (text, intent) pairs
result = router.predict("I was charged twice, please refund.")
# result.intent -> "billing"

Full Example: patterns/adapter-tuning/example.py


Pattern 16: Evol-Instruct

Category: LLM Reasoning
Use When: You need to teach a pretrained model new, complex tasks from private data by evolving simple instructions into harder ones, generating answers, and instruction tuning (SFT/LoRA)

Problem

The company wants a model that answers complex policy questions from internal docs under data privacy. Manually creating thousands of hard (question, answer) pairs is expensive.

Solution

Evol-Instruct in four steps:

  1. Evolve instructions — From seed questions, create harder variants: deeper (constraints, hypotheticals), more concrete (“list 3 reasons”), multi-step (combine two questions)
  2. Generate answers — For each instruction, produce a high-quality answer (LLM with access to your private context)
  3. Evaluate and filter — Score each (instruction, answer) 1–5; keep only examples above a threshold
  4. Instruction tuning — SFT on an open-weight model (Llama, Gemma) using the filtered dataset; PEFT/LoRA for efficient training

Code Snippet

# STEP 1: Evolve instructions
def evolve_instructions(seeds: List[str]) -> List[str]:
    # Deeper: add constraints/hypotheticals
    # Concrete: "List 3 reasons...", "What are the steps..."
    # Multi-step: combine two questions
    return all_instructions

# STEP 2: Generate answers (LLM + policy context)
qa_pairs = generate_answers(all_instructions)

# STEP 3: Score and filter (LLM or model; 1-5)
scored = [score_instruction_answer(ia) for ia in qa_pairs]
filtered = [ex for ex in scored if ex.score >= 4]

# STEP 4: SFT-ready dataset (chat format) -> then HuggingFace SFT/LoRA
sft_dataset = [{"messages": [{"role": "user", "content": ex.instruction},
                             {"role": "assistant", "content": ex.answer}]}
               for ex in filtered]
# Train with transformers + peft + trl SFTTrainer

Full Example: patterns/evol-instruct/example.py


Category 4: Reliability & Evaluation

Patterns 17–20 focus on evaluation, safety, and reliability: using LLMs to judge quality, and guard against harmful or off-policy outputs.


Pattern 17: LLM as Judge

Category: Reliability
Use When: You need nuanced evaluation of model or human outputs with scores and justifications to drive feedback loops, filtering, or training

Problem

Teams must evaluate thousands of support replies for helpfulness, tone, accuracy, clarity, and completeness. Human review does not scale; simple metrics (length, keyword match) miss nuance.

Solution

LLM as Judge uses an LLM to score and justify outputs against a scoring rubric. Three options:

  1. Prompting — Criteria and instructions in the prompt; LLM returns score (1–5) per criterion and brief justification. Temperature=0 for consistency.
  2. ML — Create rubric, collect historical (item, scores) data, train a classification model to replicate the rubric at scale.
  3. Fine-tuning — Fine-tune a model as a dedicated judge on your rubric and labeled data.

Code Snippet

SUPPORT_REPLY_CRITERIA = """
- Helpfulness: Addresses the customer's question; actionable next steps.
- Tone: Professional, empathetic.
- Accuracy: Factually correct.
- Clarity: Easy to read; no unnecessary jargon.
- Completeness: Covers the main ask.
"""

def build_judge_prompt(item: str, criteria: str) -> str:
    return f"""You are evaluating a customer support reply. Score 1-5 per criterion with brief justification.
Criteria: {criteria}
Reply: --- {item} ---
Scores:"""

# Invoke judge with temperature=0 for consistency
raw = run_judge(build_judge_prompt(reply))
result = parse_judge_response(raw, reply)
# result.scores -> [CriterionScore(criterion="Helpfulness", score=4, justification="..."), ...]

Full Example: patterns/llm-as-judge/example.py


Pattern 18: Reflection

Category: Reliability
Use When: You invoke the LLM via a stateless API and want it to correct or improve its first response without the user sending a follow-up.

Problem

The API must return a short apology email for a delayed shipment. A single LLM call may omit an order reference, sound generic, or lack a clear next step.

Solution

Reflection: Do not return the first response to the client. (1) First call ? initial response. (2) Evaluate: send initial response to an evaluator; get feedback. (3) Modified prompt: original request + initial response + feedback. (4) Second call ? revised response. Return the revised response.

Code Snippet

def run_reflection(user_prompt: str) -> ReflectionResult:
    initial_response = generate_initial(user_prompt)       # First call
    feedback, notes = evaluate(user_prompt, initial_response)  # Evaluator
    modified_prompt = (
        f"Original request:\n{user_prompt}\n\n"
        f"Your previous response:\n---\n{initial_response}\n---\n\n"
        f"Feedback to apply:\n{feedback}\n\nProduce an improved version."
    )
    revised_response = generate_revised(modified_prompt)  # Second call
    return ReflectionResult(initial_response, feedback, revised_response)
# Return revised_response to client; initial_response is not sent.

Full Example: patterns/reflection/example.py


Pattern 19: Dependency Injection

Category: Reliability
Use When: Developing and testing GenAI apps are nondeterministic, models change quickly, and you need code to be LLM-agnostic; inject LLM and tool calls

Problem

Developing and testing is hard: LLM output is nondeterministic, APIs change, and you want CI and local dev without API keys.

Solution

Dependency Injection: Pass LLM and tool calls into the pipeline as dependencies. Production uses real implementations; tests and dev use mocks that return hardcoded, deterministic results.

Code Snippet

# Pipeline accepts dependencies; no direct LLM calls inside
def run_ticket_pipeline(
    ticket_text: str,
    summarize_fn: Callable[[str], str],
    suggest_action_fn: Callable[[str, str], str],
) -> TicketResult:
    summary = summarize_fn(ticket_text)
    suggested_action = suggest_action_fn(ticket_text, summary)
    return TicketResult(summary=summary, suggested_action=suggested_action)

# Production: real implementations
result = run_ticket_pipeline(ticket, real_summarize, real_suggest_action)

# Tests: mocks (hardcoded, deterministic)
result = run_ticket_pipeline(ticket, mock_summarize, mock_suggest_action)
assert result.summary == "Customer reports an issue..."

Full Example: patterns/dependency-injection/example.py


Pattern 20: Prompt Optimization

Category: Reliability
Use When: You want better results from prompt engineering but changing the foundational model would force repeating all trials so use a repeatable optimization loop with pipeline

Solution

Prompt optimization as four components — (1) Pipeline of steps that use the prompt (prompt is a parameter), (2) Dataset to evaluate on, (3) Evaluator that scores each output, (4) Optimizer that proposes candidates and picks the best by score.

Code Snippet

def run_pipeline(prompt_template: str, ticket: str) -> str:
    return generate_fn(prompt_template, ticket)

dataset = get_dataset()

def evaluate_summary(summary: str, ticket: str) -> float:
    return 0.0  # ... length, key-info, or LLM-as-Judge

best_prompt, best_score = optimize_prompt(
    candidate_prompts=["Summarize in one sentence.", "Write a one-line summary.", ...],
    dataset=dataset,
    run_fn=lambda p, t: run_pipeline(p, t),
    eval_fn=evaluate_summary,
)
# When model changes: re-run optimize_prompt with same dataset/evaluator

Full Example: patterns/prompt-optimization/example.py


Category 5: Tools, Agents & Efficiency

Patterns 21–32 extend LLMs with tool calling, code execution, multi-agent collaboration, and production efficiency techniques.


Pattern 21: Tool Calling

Category: Tools & Agents
Use When: You need the model to act by calling APIs, looking up live order status, searching internal systems

Solution

Bind tools to the model; run a LangGraph with an assistant node (LLM) and ToolNode (executes tools). Conditional routing: if the last message has tool_calls, run tools and loop back.

Code Snippet

from langgraph.graph import END, MessagesState, StateGraph
from langgraph.prebuilt import ToolNode
from langchain_core.tools import tool

@tool
def lookup_order_status(order_id: str) -> str:
    """Look up order in OMS."""
    return '{"status":"shipped",...}'

workflow = StateGraph(MessagesState)
workflow.add_node("assistant", call_model)
workflow.add_node("tools", ToolNode([lookup_order_status]))
workflow.set_entry_point("assistant")
workflow.add_conditional_edges("assistant", route_tools_or_end)
workflow.add_edge("tools", "assistant")
app = workflow.compile()

Full Example: patterns/tool-calling/example.py
Dependencies: pip install -r patterns/tool-calling/requirements.txt; Ollama with a tool-capable model (e.g. llama3.2)


Pattern 22: Code Execution

Category: Tools & Agents
Use When: The task needs an artifact (diagram, plot, query): the model should emit a DSL and a sandbox runs it

Solution

Code execution: Prompt the model for DSL (low temperature). A sandbox writes temp files, runs dot, python (restricted), or a DB driver with timeouts and allowlists. LangGraph can wire generate_dsl ? execute_sandbox as a linear graph.

Code Snippet

from langgraph.graph import END, StateGraph

workflow = StateGraph(CodeExecutionState)
workflow.add_node("generate_dsl", node_generate_dsl)
workflow.add_node("sandbox", execute_in_sandbox)
workflow.set_entry_point("generate_dsl")
workflow.add_edge("generate_dsl", "sandbox")
workflow.add_edge("sandbox", END)
app = workflow.compile()
final = app.invoke({"user_request": "..."})

Full Example: patterns/code-execution/example.py
Dependencies: pip install -r patterns/code-execution/requirements.txt; optional Graphviz (brew install graphviz)


Pattern 23: Multi-Agent Collaboration

Category: Tools & Agents
Use When: Work is multistep, multi-domain, and long-running; a single agent hits cognitive, tool, and tuning limits

Solution

Multi-agent collaboration: Define agents with narrow mandates and clear handoffs. Patterns include hierarchical (planner delegates), prompt chaining (sequential pipelines), peer-to-peer / blackboard (shared store), and parallel execution.

Code Snippet

from langgraph.graph import END, StateGraph

g = StateGraph(MultiAgentState)
g.add_node("plan", node_plan)
g.add_node("technical", node_technical)
g.add_node("compliance", node_compliance)
g.add_node("merge", node_merge)
g.add_node("critic", node_critic)
g.add_node("finalize", node_finalize)
g.set_entry_point("plan")
app = g.compile()
result = app.invoke({"user_request": "..."})

Full Example: patterns/multiagent-collaboration/example.py


Pattern 24: Small Language Model

Category: Efficiency & Deployment
Use When: Frontier models are too large or too expensive to self-host; you want smaller models, distillation, quantization, or faster decoding

Solution

  1. Knowledge distillation — Train a student on teacher soft targets; KL divergence aligns token distributions
  2. Quantization — 4-bit / 8-bit weights (BitsAndBytesConfig, NF4) shrink footprint
  3. Speculative decoding — A small draft model proposes tokens; a large target verifies in parallel

Code Snippet

# Distillation: minimize KL(student || teacher) on teacher softmax + CE on labels
# Quantization: BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", ...)
# Speculative decoding: vLLM speculative_config={"model": draft_id, "num_speculative_tokens": k}

Full Example: patterns/small-language-model/example.py


Pattern 25: Prompt Caching

Category: Efficiency & Deployment
Use When: The same or similar prompts hit your LLM repeatedly and you want lower latency, lower cost, and less load

Solution

  1. Client-side exact cache — Hash (model, params, messages) ? store response
  2. Framework caches — LangChain InMemoryCache / SQLiteCache via set_llm_cache
  3. Semantic cache — Embeddings to match paraphrases; return cached answer if similarity ? threshold
  4. Server-side prompt caching — Anthropic / OpenAI may cache eligible long prompts inside the API

Code Snippet

# Exact: sha256(f"{model}\n{prompt}") -> response
# Semantic: cosine(embed(query), embed(cached_prompt)) >= threshold
# LangChain: set_llm_cache(SQLiteCache(database_path="..."))
# Provider: Anthropic cache_control / OpenAI automatic prefix caching (see docs)

Full Example: patterns/prompt-caching/example.py


Pattern 26: Inference Optimization

Category: Efficiency & Deployment
Use When: You self-host LLMs and must maximize throughput, cut latency, and control KV-cache memory

Solution

  1. Continuous batching (dynamic batching) — Requests enter and leave at fine granularity; vLLM (PagedAttention) and SGLang reduce padding waste
  2. Speculative decoding — Draft + target models (see Pattern 24)
  3. Prompt compression — Remove redundancy in system + RAG context to shrink KV footprint

Code Snippet

# Continuous batching: use vLLM / SGLang / TensorRT-LLM — not hand-rolled pad batches
# Speculative decoding: vLLM speculative_config={...} (see Pattern 24)
# Prompt compression: dedupe, summarize, or learned compressors before .generate()

Full Example: patterns/inference-optimization/example.py


Pattern 27: Degradation Testing

Category: Efficiency & Deployment
Use When: You need load testing that matches LLM inference behavior with TTFT, end-to-end latency, token throughput, and RPS under rising concurrency

Key Metrics

  • TTFT — Time from request to first token (streaming)
  • EERL — End-to-end request latency (wall time to last token)
  • Output tokens / second — Generation throughput
  • Requests / second — Completed requests per second at a given concurrency

Code Snippet

# Per request: ttft_s, eerl_s, output_tokens -> tok/s = tokens / (eerl_s - ttft_s)
# Aggregate: p95_ttft, p95_eerl, mean tok/s, rps = n / wall_time
# Tools: LLMPerf, LangSmith traces

Full Example: patterns/degradation-testing/example.py


Pattern 28: Long-Term Memory

Category: Memory & Agents
Use When: LLM calls are stateless; you need continuity across sessions with working, episodic, procedural, and semantic memory

Solution

  1. Working memory — Recent turns / scratch context (sliding window)
  2. Episodic memory — Dated interactions (“what we did”)
  3. Procedural memory — Playbooks and tool recipes
  4. Semantic memory — Stable facts; typically embedding search (Mem0, custom RAG-over-memories)

Code Snippet

# Mem0: memory.add(messages, user_id=...); memory.search(query, user_id=...)
# Four layers: working (deque), episodic (log), procedural (playbooks), semantic (vector / KV)

Full Example: patterns/long-term-memory/example.py
Dependencies: pip install mem0ai openai chromadb for Mem0-aligned version


Pattern 29: Template Generation

Category: Setting Safeguard
Use When: You need repeatable, reviewable customer-facing text; full free-form generation is too variable or mixes facts with creativity unsafely

Solution

  1. Prompt the model to output a template only, with explicit placeholders ([CUSTOMER_NAME], [ORDER_ID], …)
  2. Human / comms reviews the template (per locale/product), not every send
  3. Fill slots in code; optional second LLM pass only for lint or translation
  4. Few-shot examples in the prompt show approved shapes so new templates stay grounded

Code Snippet

# Low temp + few-shot -> template with [SLOT_NAME]
# validate required [ORDER_ID], [CUSTOMER_NAME] present
# fill_template(template, slots_from_crm)

Full Example: patterns/template-generation/example.py


Pattern 30: Assembled Reformat

Category: Setting Safeguard
Use When: A full LLM-generated page can hallucinate high-risk attributes (battery chemistry, hazmat, allergens, medical claims)

Solution

  1. Risk registry — chemistry, Wh, hazmat, etc. from structured sources only
  2. Assemble deterministic blocks (specs, shipping, legal)
  3. Optional LLM for tone/SEO only, conditioned on the assembled facts
  4. Validate — banned claims, chemistry contradictions

Code Snippet

# facts = load_pim(sku)  # BatteryChemistry.NIMH, ...
# page = render_compliance_block(facts)  # deterministic
# fluff = llm_marketing(facts)  # constrained; validate_high_risk(page + fluff, facts)

Full Example: patterns/assembled-reformat/example.py


Pattern 31: Self-Check

Category: Setting Safeguard
Use When: You can obtain per-token logprobs from inference and want a statistical signal to flag uncertain or fragile generations for review

Solution

  1. Logits ? softmax ? probabilities (p_i)
  2. Logprob (log p) for the sampled token (APIs often return this directly)
  3. Flag tokens with low (p) or small margin to the second-best token
  4. Perplexity on a sequence: PPL = exp(-mean(logprobs))

Code Snippet

# p_i = exp(logprob_i); flag if p_i < threshold
# PPL = exp(-mean(logprobs))  # natural-log probs per token

Full Example: patterns/self-check/example.py


Pattern 32: Guardrails

Category: Setting Safeguard
Use When: You must enforce security, privacy, moderation, and alignment around LLM and RAG systems

Solution

  1. Prebuilt — Gemini safety settings; OpenAI Moderation API; hosted provider filters
  2. Custom — PII redaction, banned topics, allowlists, regex injection detectors, LLM-as-Judge (Pattern 17)
  3. Composeapply_guardrails(text, scanners) pipeline; scan query, then answer

Code Snippet

# apply_guardrails(user_query, [pii_redact, banned_topic])
# answer = engine.query(sanitized); apply_guardrails(answer, [pii_redact, moderation])

Full Example: patterns/guardrails/example.py


Category 6: Agentic Behavior Patterns

Patterns 33–50 align with Agentic Design Patterns (Antonio Gulli): specialized agent roles, orchestration, and production agentic systems.


Pattern 33: Prompt Chaining

Category: Agentic Orchestration
Use When: A task benefits from sequential decomposition where each LLM call has one job, structured output feeds the next step

Solution

Code Snippet

# state = classify(q); state = decompose(state); state = answer(state); state = format(state)
# Or LangGraph: add_node per step, linear edges

Full Example: patterns/prompt-chaining/example.py


Pattern 34: Routing

Category: Agentic Orchestration
Use When: You must classify or direct each request to the right handler

Solution

  1. Rule-based routing for deterministic paths
  2. Embedding similarity to handler descriptions or labeled exemplars
  3. LLM routing with JSON schema: route, confidence, optional rationale
  4. ML classifier on features for scale and SLOs

Code Snippet

# RunnableBranch (langchain_core): (predicate, runnable), ..., default
# Or: rules_first = route_rules(text); if conf < 0.9: route_llm(text)

Full Example: patterns/routing/example.py


Pattern 35: Parallelization

Category: Agentic Orchestration
Use When: Independent subtasks can run together such as research fan-out, analytics partitions, parallel validators

Solution

Code Snippet

# LCEL: RunnableParallel(gather=..., analyze=..., verify=...) | RunnableLambda(merge)
# stdlib: ThreadPoolExecutor; submit each branch; as_completed ? dict

Full Example: patterns/parallelization/example.py


Pattern 36: Learning and Adaptation

Category: Agentic Learning
Use When: Systems must improve from experience such as RL (PPO with clipped surrogate for stability) and preference alignment (DPO without a separate reward model)

Solution

  1. RL agents — collect trajectories ? advantage estimates ? PPO-style clipped ratio to limit destructive updates
  2. LLM alignment — RLHF path (reward model + PPO) vs DPO (direct policy update from chosen/rejected completions)
  3. Online / memory — replay, regularization, retrieval over past successes

Code Snippet

# PPO: clip ratio r to [1-eps, 1+eps]; surrogate min(r*A, clip(r)*A)
# DPO: preference loss on log pi(y_w) - log pi(y_l) vs reference (see TRL / papers)

Full Example: patterns/learning-adaptation/example.py


Pattern 37: Exception Handling and Recovery

Category: Agentic Reliability
Use When: Agents, chains, and tools must survive failures by detecting and classifying errors, and retrying wisely and fallback to degraded paths

Solution

  1. Detect — Structured errors, validation, guardrails, timeouts
  2. Classify — Transient vs permanent vs policy
  3. Handle — Exponential backoff, circuit breaker, fallback model or cache
  4. Recover — Idempotent retries, compensation, checkpoint resume
def run_with_fallback(
    primary: Callable[[], T],
    fallback: Callable[[], T],
    is_recoverable: Callable[[BaseException], bool],
) -> T:
    """
    Try ``primary``; on a recoverable exception, invoke ``fallback``.

    Args:
        primary: Preferred code path (e.g. frontier model).
        fallback: Degraded path (e.g. smaller model or cached stub).
        is_recoverable: Whether to use fallback for this exception type.

    Returns:
        Result from primary or fallback.

    Raises:
        Re-raises if the primary fails with a non-recoverable error.
    """
    try:
        return primary()
    except Exception as exc:
        if not is_recoverable(exc):
            raise
        return fallback()

Full Example: patterns/exception-handling-recovery/example.py


Pattern 38: Human-in-the-Loop (HITL)

Category: Agentic Safety
Use When: Automation must yield to people for quality, compliance, or risk

Solution

  1. Triggers — Low confidence, high stakes, novel situations, regulatory rules, sampling
  2. Review — Queues, rubrics, SLAs, multi-level approval
  3. Feedback — Labels and edits ? datasets, policies, routing
  4. Orchestration — LangGraph interrupt / human nodes; workflow engines with wait states

Code Snippet

# if stakes == HIGH or conf < tau: enqueue(HumanReviewTicket)
# LangGraph: interrupt_before=[human_node]; resume with Command

Full Example: patterns/human-in-the-loop/example.py


Pattern 39: Agentic RAG (Knowledge Retrieval)

Category: Agentic Knowledge
Use When: You need up-to-date, source-grounded answers with embeddings, semantic search, chunking, vector stores, and advanced variants

Solution

  1. Chunk ? embed ? vector DB; measure relevance via cosine / distance metrics
  2. Hybrid retrieval (dense + sparse) where lexical match matters
  3. Graph RAG for entity-centric queries; agentic RAG for query rewrite, tool retrieval, multi-hop
  4. LangChain LCEL / LangGraph for pipelines and cycles

Code Snippet

# LCEL: RunnablePassthrough.assign(context=retriever) | prompt | llm
# LangGraph: retrieve -> grade_documents -> [rewrite_query | generate]

Full Example: patterns/agentic-rag/example.py
See also Patterns 6–12 for depth implementations of each RAG component.


Pattern 40: Resource-Aware Optimization

Category: Agentic Efficiency
Use When: You must optimize LLM and agent workloads for cost, latency, capacity, and graceful degradation

Solution

  1. Budgets and tiered models — Estimate $ per request
  2. Route by priority and load (Pattern 34)
  3. Prune / summarize context; cache (25); smaller models (24)
  4. Degrade — Fewer tools, shorter answers, async handoff

Code Snippet

# if budget.remaining() < need: summarize(history) or tier = "small"
# if degradation == MINIMAL: tool_gate.disable_heavy_tools()

Full Example: patterns/resource-aware-optimization/example.py


Pattern 41: Reasoning Techniques (Agentic)

Category: Agentic Reasoning
Use When: You need a structured approach to complex Q&A using CoT, ToT, self-correction, PAL / code-aided reasoning, ReAct, RLVR, debates (CoD), deep research

Solution

Use the technique map in patterns/reasoning-techniques/README.md: CoT (13), ToT (14), Reflection (18), Deep Search (12), ReAct / tools (21), PAL-style code (22), multi-agent debates (23), prompt / workflow optimization (20).

def language_agent_tree_search_stub(
    frontier: list[str],
    expand_fn: Callable[[str], list[str]],
    score_fn: Callable[[str], float],
    beam_width: int = 2,
) -> list[tuple[str, float]]:
    """
    Minimal beam-style selection (stand-in for Language Agent Tree Search).

    LATS in the literature expands **language** states/actions, scores children
    with a value model or LLM critic, and prunes—unlike a flat ToT breadth list.

    Args:
        frontier: Current candidate partial solutions or thoughts.
        expand_fn: Callable taking one candidate, returning child strings.
        score_fn: Callable taking a string, returning higher-is-better score.
        beam_width: Max states to keep after scoring.

    Returns:
        Top ``beam_width`` (candidate, score) pairs.
    """
    children: list[tuple[str, float]] = []
    for node in frontier:
        for ch in expand_fn(node):
            children.append((ch, float(score_fn(ch))))
    children.sort(key=lambda x: x[1], reverse=True)
    return children[:beam_width]

Full Example: patterns/reasoning-techniques/example.py


Pattern 42: Evaluation and Monitoring (Agentic)

Category: Agentic Observability
Use When: You need performance tracking, A/B tests, compliance evidence, latency SLOs, token/cost telemetry, custom quality metrics (LLM-as-judge), and multi-agent traces

Solution

Instrument calls, aggregate SLAs, run experiments with guardrail metrics, store audit evidence, trace multi-agent workflows.

Code Snippet

# trace_id + span per LLM/tool; tokens += pt+ct; export to OTLP
# ab_variant(user_key, "exp", ("a","b")); compare judge_score & p95_latency

Full Example: patterns/evaluation-monitoring/example.py


Pattern 43: Prioritization

Category: Agentic Scheduling
Use When: Competing tasks must be ordered to support queues, cloud jobs, trading paths, security incidents using multi-criteria scores, dynamic re-ranking, and resource-aware scheduling

Solution

Weighted dimensions (urgency, impact, effort, SLA, security), recompute on events, integrate with routing (34) and capacity (40).

Code Snippet

# score = w1*urgency + w2*importance - w3*effort + w4*f(sla) + w5*security

Full Example: patterns/prioritization/example.py


Pattern 44: Memory Management

Category: Agentic State
Use When: Agents need short-term context, long-term persistence, episodic retrieval, procedural playbooks, and privacy-aware storage

Solution

Tier memory (working, episodic, procedural, semantic); extract and retrieve selectively; persist orchestrator state with MemorySaver.

Code Snippet

# LangGraph: compile(..., checkpointer=MemorySaver()); thread_id in config
# External: memory.search(query, user_id=...) for semantic / episodic layers

Full Example: patterns/memory-management/example.py


Pattern 45: Planning and Task Decomposition

Category: Agentic Orchestration
Use When: You need explicit task graphs, dependencies, and valid execution order

Decompose goals into a DAG of subtasks with dependencies. The planner agent determines which tasks to run in parallel vs. sequentially based on dependency analysis.

Full Example: patterns/planning-task-decomposition/example.py


Pattern 46: Goal Setting and Monitoring

Category: Agentic Governance
Use When: SMART goals, progress vs. targets, deviation detection, strategy updates

The goal-monitor agent tracks metrics against defined targets, detects when progress deviates from expected trajectories, and adjusts strategy when needed.

Full Example: patterns/goal-setting-monitoring/example.py


Pattern 47: MCP Integration (Agentic)

Category: Tooling / Integration
Use When: Model Context Protocol servers — discovery, tools/list, tools/call — secure composition with Pattern 21

Model Context Protocol (MCP) provides a standardized interface between agents and external resources. Agents discover available tools at runtime through the protocol, call them with structured inputs, and receive structured outputs.

Full Example: patterns/mcp-integration/example.py


Pattern 48: Inter-Agent Communication

Category: Distributed Agents
Use When: Message envelopes, routing, correlation, capability discovery (A2A-style) with Pattern 23

Agent-to-Agent (A2A) communication defines structured message schemas and communication protocols for inter-agent coordination. Agents send typed messages (task assignments, results, status updates, requests for clarification) through a message bus or shared workspace.

Full Example: patterns/inter-agent-communication/example.py


Pattern 49: Safety Guardian

Category: Safety / Compliance
Use When: Multi-layer defense, risk thresholds, shutdown paths beyond single guardrail scanners (extends Pattern 32)

The safety guardian agent implements three-tier protection: pre-action guardrails (evaluate the proposed action before execution), in-process monitoring (enforce scope and resource constraints during execution), and post-action auditing. Includes prompt injection detection for agents that process external content.

Full Example: patterns/safety-guardian/example.py


Pattern 50: Exploration and Discovery

Category: Search / Learning
Use When: Explore vs. exploit, novel environments, hypothesis cycles (pairs with Patterns 12, 14, 41, 36)

Implements a multi-armed bandit or curiosity-driven strategy that balances exploitation (using known-good approaches) with exploration (trying new approaches to discover if they’re better). Scores update from outcomes, so the agent continuously refines its strategy distribution.

Full Example: patterns/exploration-discovery/example.py


Pattern Comparison Matrix

#PatternComplexityTraining RequiredBest For
1Logits MaskingMediumNoValid JSON, banned words
2Grammar Constrained GenerationHighNoAPI configs, schemas
3Style TransferLow–MediumOptionalNotes to emails
4Reverse NeutralizationHighYesYour writing style
5Content OptimizationHighYesOpen rates, conversions
6Basic RAGMediumNoDocumentation, knowledge bases
7Semantic IndexingHighNoTechnical docs, multimedia, tables
8Indexing at ScaleHighNoHealthcare guidelines, policies
9Index-Aware RetrievalHighNoTechnical docs, API docs
10Node PostprocessingHighNoLegal docs, medical records
11Trustworthy GenerationHighNoMedical Q&A, legal research
12Deep SearchHighNoMarket research, due diligence
13Chain of ThoughtLow–MediumNoPolicy eligibility, math, compliance
14Tree of ThoughtsHighNoRoot-cause analysis, design exploration
15Adapter TuningMedium–HighYes (adapter only)Intent routing, content moderation
16Evol-InstructHighYes (SFT/LoRA)Policy Q&A, compliance playbooks
17LLM as JudgeLow–MediumNoSupport quality, model evaluation
18ReflectionMediumNoDrafts, code, plans
19Dependency InjectionLow–MediumNoFast deterministic tests with mocks
20Prompt OptimizationMedium–HighNoSummarization, copy, classification
21Tool CallingMedium–HighNoAPIs, live data, actions (ReAct)
22Code ExecutionMedium–HighNoDiagrams, plots, SQL
23Multi-Agent CollaborationHighNoVendor review, incidents, research crews
24Small Language ModelMedium–HighOptionalCost, VRAM, throughput
25Prompt CachingLow–MediumNoRepeated prompts, long prefixes
26Inference OptimizationMedium–HighNoSelf-hosted throughput, KV memory
27Degradation TestingMedium–HighNoTTFT, EERL, tok/s, RPS; LLMPerf
28Long-Term MemoryMedium–HighNoStateful assistants, personalization
29Template GenerationLow–MediumNoTransactional email/SMS
30Assembled ReformatMedium–HighNoPDPs with hazmat/battery risk
31Self-CheckMediumNoLogprobs, perplexity, uncertainty triage
32GuardrailsMedium–HighNoSecurity, moderation, PII
33Prompt ChainingLow–MediumNoSequential workflows, structured handoffs
34RoutingLow–HighOptionalIntent ? handler, tools, subgraph
35ParallelizationLow–MediumNoResearch, analytics, multimodal
36Learning and AdaptationHighYesRL, preferences, online drift
37Exception Handling & RecoveryLow–HighNoAgent tools, chains, APIs
38Human-in-the-Loop (HITL)Low–HighNoModeration, fraud, trading, safety
39Agentic RAGMedium–HighNoFresh knowledge, multi-hop retrieval
40Resource-Aware OptimizationMedium–HighNoCost/latency budgets, degradation
41Reasoning TechniquesLow–Very HighVariesCoT, ToT, ReAct, PAL, debates
42Evaluation and MonitoringMedium–HighNoLLM-judge metrics, multi-agent spans
43PrioritizationLow–HighNoSupport, cloud jobs, trading, security
44Memory ManagementMedium–HighNoLangGraph threads, episodic retrieval
45Planning & Task DecompositionMedium–HighNoDAG tasks, dependencies
46Goal Setting & MonitoringMediumNoSMART goals, deviation detection
47MCP IntegrationMediumNoTool servers, discovery
48Inter-Agent CommunicationMedium–HighNoMessages, routing, A2A
49Safety GuardianHighNoLayered safety, shutdown paths
50Exploration & DiscoveryMediumNoExplore/exploit, novel domains

Choosing the Right Pattern

If you need…Use…
Enforce constraints during generationPattern 1: Logits Masking
Formal grammar compliancePattern 2: Grammar Constrained Generation
Transform content style quicklyPattern 3: Style Transfer (Few-Shot)
Consistent personal stylePattern 4: Reverse Neutralization
Optimize for performance metricsPattern 5: Content Optimization
External knowledge augmentationPattern 6: Basic RAG
Semantic search or complex contentPattern 7: Semantic Indexing
Large-scale, evolving knowledge with freshnessPattern 8: Indexing at Scale
Handle vocabulary mismatchesPattern 9: Index-Aware Retrieval
Ambiguous entities, conflicting or verbose chunksPattern 10: Node Postprocessing
Prevent hallucination and build user trustPattern 11: Trustworthy Generation
Comprehensive research with multi-hop reasoningPattern 12: Deep Search
Multistep reasoning or auditable reasoning tracePattern 13: Chain of Thought
Explore multiple strategies or hypothesesPattern 14: Tree of Thoughts
Specialize a foundation model with small labeled dataset (100s–1000s pairs)Pattern 15: Adapter Tuning
Teach a model new tasks from private dataPattern 16: Evol-Instruct
Scalable, nuanced evaluation with scores and justificationsPattern 17: LLM as Judge
Self-correction in stateless APIs without user follow-upPattern 18: Reflection
Develop and test GenAI pipelines without flaky LLM callsPattern 19: Dependency Injection
Find good prompts, re-run when model changesPattern 20: Prompt Optimization
Call APIs, live systems, or tools (not only RAG)Pattern 21: Tool Calling
Diagrams, plots, or queries as DSL executed in sandboxPattern 22: Code Execution
Multiple specialized roles, decomposition, parallel workPattern 23: Multi-Agent Collaboration
Run on smaller GPUs, cut cost, speed up decodingPattern 24: Small Language Model
Avoid recomputing repeated or paraphrased promptsPattern 25: Prompt Caching
Higher throughput and lower KV pressure on self-hosted LLMsPattern 26: Inference Optimization
Load tests with LLM-native metrics (TTFT, EERL, tok/s, RPS)Pattern 27: Degradation Testing
Durable user context beyond raw chat historyPattern 28: Long-Term Memory
On-brand, reviewable customer email/SMSPattern 29: Template Generation
Product pages where wrong specs are unacceptablePattern 30: Assembled Reformat
Flag uncertain generations using token probabilitiesPattern 31: Self-Check
Policy enforcement (PII, banned topics, moderation)Pattern 32: Guardrails
Reliable multi-step workflows with structured handoffsPattern 33: Prompt Chaining
Pick the right tool, model, or specialist pathPattern 34: Routing
Run independent tasks concurrently, then mergePattern 35: Parallelization
Improve from rewards, preferences, or streaming feedbackPattern 36: Learning and Adaptation
Agents and chains that survive tool/API failuresPattern 37: Exception Handling & Recovery
People in the loop for high-stakes decisionsPattern 38: HITL
Gulli-level RAG with agentic retrieval loopsPattern 39: Agentic RAG
Cost/latency-aware agents with graceful degradationPattern 40: Resource-Aware Optimization
Map of reasoning methods tied to implementationsPattern 41: Reasoning Techniques
Production observability: latency, tokens, A/B testsPattern 42: Evaluation and Monitoring
Rank competing tasks or incidentsPattern 43: Prioritization
LangGraph-style memory tiers + checkpointingPattern 44: Memory Management
Explicit task DAGs and dependency orderPattern 45: Planning & Task Decomposition
SMART goals and deviation from targetsPattern 46: Goal Setting & Monitoring
MCP tool servers with discovery and secure callsPattern 47: MCP Integration
Agent message fabric / A2A-style coordinationPattern 48: Inter-Agent Communication
Layered safety beyond I/O scannersPattern 49: Safety Guardian
Explore vs. exploit in open-ended searchPattern 50: Exploration & Discovery

Takeaways

Here are major takeaways from these agentic patterns:

Enforce constraints early. Logits Masking and Grammar Constrained Generation prevent bad output at the token level. The same logic applies to Guardrails: put them in the runtime layer, not in the system prompt.

RAG is a stack you build layer by layer. Start with Basic RAG. When vocabulary gaps break retrieval, add Semantic Indexing. When contradictions surface, add Indexing at Scale. When queries don’t match chunks, add Index-Aware Retrieval. When retrieved chunks are noisy, add Node Postprocessing. Each pattern fixes the failure mode of the one before.

Structure the reasoning. Chain of Thought, Tree of Thoughts, and ReAct all treat reasoning as something to engineer. Adding “think step by step” costs one line and measurably improves multi-step accuracy. Tree of Thoughts costs more but handles problems where a single reasoning path gets stuck.

You need less data than you think for specialization. Adapter Tuning and Evol-Instruct both produce strong task-specific models from hundreds of examples, not millions. Evolving seed questions into harder variants and filtering by quality gives you a curriculum worth training on. The bottleneck is usually data quality, not quantity.

The operational patterns matter as much as the modeling ones. Prompt Caching, Inference Optimization, and Degradation Testing don’t appear in research papers. They’re the difference between a working demo and a system that holds up under real traffic.


I have added examples for all 50 patterns at: github.com/bhatti/agentic-patterns

Run the setup in ten minutes with SETUP.md, then explore whichever patterns are most relevant to what you’re building.

Technology Stack

  • Ollama — Local model serving
  • LangChain — LLM orchestration
  • CrewAI — Multi-agent systems
  • LangGraph — Stateful agent workflows
  • HuggingFace — Model hub and transformers
  • PyTorch — Direct model access and logits manipulation

March 13, 2026

Load Testing Applications That Actually Scale: A Practitioner’s Guide

Filed under: Computing — admin @ 3:55 pm

In past projects, I saw most engineering teams ran load tests before major launches and rarely at any other time. The assumption is that if a code change is small, performance is probably fine. In practice, that assumption fails regularly. A runtime upgrade can change memory allocation patterns, garbage collection behavior, and connection handling in ways that only appear under load. A third-party library upgrade can introduce synchronous blocking where there was none before. A new database index can shift query planner behavior and affect read latency at scale. None of these surface in functional tests. None of them are visible in code review. They show up under load, in production, usually at the worst possible time.

Performance testing isn’t a pre-launch ceremony. It’s part of how you understand and maintain your system’s behavior as your code evolves, your dependencies change, and your traffic grows. This guide covers the full scope: the test types and what each one tells you, how to design meaningful tests, what metrics to collect, which tools to use, how to handle dependencies in your tests, and how to make this a regular part of your development process rather than a one-time event.


Why Performance Testing Gets Skipped

Often teams skip performance testing due to setup time, cost or slow feedback loop. These constraints are legitimate, but they lead to a familiar outcome where performance problems get discovered in production. Another common pattern I have observed is that many teams don’t have a clear baseline picture of how their application actually behaves. They don’t know their normal memory footprint. They don’t know which code paths are hot. They don’t know at what concurrency level their database connection pool saturates or when their cache hit rate starts degrading. Without a baseline, you can’t detect regressions, you can’t capacity plan accurately, and you can’t tell a normal traffic spike from an actual problem. The goal of performance testing is to know your system well enough to predict how it behaves and catch it when behavior changes unexpectedly.


Performance Testing in the SDLC

The most effective teams don’t treat performance testing as a separate phase instead they integrate it into their regular development process at multiple levels.

  • During development: I have found profiling tools like JProbe/Yourkit for Java, pprof for GO, V8 profiler for Node.js, XCode instruments for Swift/Objective-C incredibly useful to find hot code path, memory leaks or concurrency issues.
  • During code review: Another common pattern that I have found useful is flagging changes to caching, database queries, serialization, or hot code paths for load testing before merge.
  • Nightly CI/CD pipelines: Though, load testing on each commit would be excessive but they can partially run as part of nightly build so that we can fix them before they reach production.
  • On a regular schedule: Another option is to run full-scale load and soak tests run on a defined cadence like weekly.
  • Before major releases: Comprehensive tests covering all scenarios like average load, peak load, stress, spikes, soak can run against a production-representative environment.
  • After significant dependency upgrades: Runtime upgrades, major library version bumps, and infrastructure changes all deserve their own performance test pass.

The Testing Taxonomy

Following are different types of performance tests:

Profiling

Profiling instruments your application during execution and shows you exactly where time and memory are spent like which functions consume CPU, which allocate the most memory, where goroutines or threads block. You can run profiling locally before the code review so that you understand the bottlenecks already exist in your code. Load testing tells you how those bottlenecks behave when many users hit them simultaneously. Most runtimes include profiling support like Go’s pprof, Node.js’s built-in CPU and heap profiler, Python’s cProfile so you can also enable them in a test environment if needed.

Load Testing

Load testing applies a realistic, expected workload and verifies the system meets defined performance targets. The workload mirrors production traffic such as request distribution, concurrency level, and payload shapes. The goal isn’t to break anything. It’s to confirm the system handles its designed workload within acceptable response times and error rates. Any change that could affect throughput like a code change in a hot path, a dependency upgrade, a configuration change, a schema migration should warrant a load test.

Stress Testing

Stress testing pushes load well beyond expected levels to find where the system breaks and how it breaks. At what point does performance degrade? What component fails first? Does the system fail gracefully or catastrophically, or corrupting state? In past projects, I found a practical target in cloud environments is 10x your expected peak load. This accounts for real-world variability: viral traffic events, bot traffic, cascading retries from upstream services, and faster growth than planned. Stress tests also expose whether your failure modes are safe. When your system can’t keep up, what happens? Does it queue requests until it runs out of memory? Does it reject new connections cleanly with meaningful errors? Does retry behavior from clients amplify load turning a recoverable spike into a full outage?

Spike Testing

Spike testing applies an abrupt load increase not a gradual ramp but a sharp jump so that we can learn how the system absorbs and recovers from it. This simulates promotional emails going out, products appearing in news, scheduled batch jobs triggering thousands of concurrent operations, or a mobile app push notification causing a synchronized rush of API calls. The spike testing can identify problems like cold-start latency when new instances initialize, connection pool exhaustion when concurrency jumps faster than the pool replenishes, cache stampedes when many concurrent requests miss cache simultaneously, and auto-scaling lag when the metric-to-action delay is too long. After the spike, watch recovery. Latency should return to baseline. Resource utilization should drop. If it doesn’t, the system is carrying forward pressure that will degrade subsequent traffic.

Soak Testing

Soak testing runs a moderate, sustained load over an extended period from several hours to several days. The load level isn’t extreme; the duration is the point so that it can uncover problems that only occur after a long duration such as:

  • Memory leaks: Usage climbs slowly and continuously. The system that runs fine for 30 minutes may run out of heap after 8 hours. This is especially important to test after runtime or library upgrades, which can change allocator behavior.
  • Connection leaks: Database or HTTP connections that aren’t properly released accumulate until the pool is exhausted.
  • Thread accumulation: Background threads that don’t terminate properly compound over time.
  • Disk exhaustion: Log files that aren’t rotated, or temporary files that aren’t cleaned up, fill disk gradually.
  • Cache degradation: Caches misconfigured for their access patterns may perform well initially and degrade as the working set evolves.
  • GC pressure: Garbage collection that runs cleanly initially can become increasingly frequent and pause-heavy as heap fragmentation grows over time.

Scalability Testing

Scalability testing validates that your system scales up to absorb increasing load and scales back down when load subsides. Cloud infrastructure assumes elastic scaling so scalability testing verifies the assumption. This helps verify that: the metric driving scale-up (CPU, request rate, queue depth) actually reaches its threshold under realistic load. The scaling event actually reduces the pressure that triggered it. Scale-up happens fast enough that users don’t experience degradation during the lag. Scale-down doesn’t trigger an immediate scale-up cycle, creating instability. In practice, auto-scaling especially first scale event can take several minutes so you need to make sure that you have some extra capacity to handle increased load.

Volume Testing

Many performance characteristics change materially as data grows. Index scan times increase. Query planner behavior shifts. Cache hit rates drop as the working set outgrows cache size. Search latency that is acceptable at 50 million records may become unacceptable at 250 million. Test at your current production data volume, then at projected volumes for 1 and 3 years out. The time to address data growth challenges in architecture is before you’re already there.

Recovery Testing

Recovery testing applies an abnormal condition like a dependency failure, a network partition, a resource exhaustion event and measures how long the system takes to return to normal operation. The key questions: does the system recover at all? How long does recovery take? What’s the user-visible impact during the recovery window?


Handling Dependencies in Your Tests

One of the practical decisions in every load test is what to do about dependencies like external APIs, third-party services, internal microservices, payment processors, identity providers, email services, and so on. You have two approaches, and which one you choose depends on what your test is trying to answer.

Mock Dependencies When You’re Focused on Your Own Code

When your goal is to validate your application’s internal performance like memory footprint, CPU usage, throughput of your business logic, efficiency of your data access layer then mocking external dependencies is often the right call. However, you will need to build a well-designed mock that returns realistic response payloads with configurable latency. Mocking lets you:

  • Isolate your application’s performance characteristics from the noise of external variability
  • Simulate dependency failure modes (timeouts, errors, slow responses) in a controlled way
  • Run tests without consuming third-party quotas or generating costs in external systems
  • Reproduce specific latency profiles to understand how your code behaves under different dependency performance conditions

Include Real Dependencies When Integration Behavior Matters

When your goal is to validate end-to-end system behavior including the interaction effects between your system and its dependencies then you can use real dependencies or realistic stubs deployed under your control. The reason this matters: under load, dependencies behave differently than they do at idle. For example, higher latency in dependencies can propagate creating back-pressure in your system that a mock would never reveal. Dependencies that are slow, throttled, or unavailable under load can:

  • Exhaust your connection pools (connections held open waiting for a slow response)
  • Fill your request queues (new requests queueing behind slow in-flight requests)
  • Trigger retry storms (your retry logic amplifying load on an already-struggling dependency)
  • Surface timeout and circuit-breaker behavior that only activates under real latency conditions

If you include real third-party services in your load test, be explicit about two things: you may consume quota and generate costs, and their performance becomes part of your results. When a dependency is slow, it appears as latency in your own metrics — know what you’re measuring.

A practical middle ground: deploy internal stubs for your external dependencies. A stub is a service you control that returns realistic responses with configurable behavior. Unlike a mock in a test harness, a stub runs as a real service and participates in your actual network topology. It lets you test realistic integration behavior without the unpredictability or cost of real external services.

Watch for Automatic Retry Amplification

Another factor that can skew results from performance testing is automated retries at various layers when a request fails or times out. Under load, this multiplies traffic. If your application generates 400 write operations per second against a dependency, and that dependency starts returning errors, your client may retry each failed request two or three times and suddenly generating 800 to 1,200 operations per second against an already-struggling system. In your load tests, verify that your retry behavior is bounded and doesn’t turn a manageable degradation into a cascading failure. Exponential backoff with jitter, retry budgets, and circuit breakers all exist to prevent this.


Design Your Load Model

Before writing a test script, model the load you intend to generate. A poorly designed load model produces results that feel meaningful but don’t correspond to anything real.

Use Production Traffic Patterns as Your Starting Point

Study your actual production metrics. Identify:

  • Average requests per second across a normal operating period
  • Peak requests per second during your highest-traffic periods
  • Request distribution across endpoints: what percentage of traffic hits each API? Most services have a small number of high-traffic endpoints and many low-traffic ones.
  • Read/write ratio: most production services are read-heavy; your load model should reflect that
  • Payload characteristics — average request and response sizes
  • User session behavior: are users authenticated? Do requests carry session state? Do later requests in a workflow depend on earlier ones?
  • Geographic distribution: does your traffic come from one region or many?

Use Stepped Load Progression

Ramp load gradually rather than jumping to peak immediately. A stepped approach produces distinct data points at each level, making it easier to identify where behavior changes.

Hold each step long enough for metrics to stabilize and for any auto-scaling events to complete. If your auto-scaling policy triggers after 5 minutes of sustained high CPU, your steps need to run for at least 7-10 minutes. Steps that are too short produce transient data that doesn’t represent steady-state behavior.

Model Think Time

Real users don’t send requests as fast as possible. They read pages, fill forms, wait for results, and make decisions. Think time like the pause between user actions should be randomized within a realistic range based on observed production behavior. Omitting think time concentrates load artificially, inflates concurrency counts, and produces results that don’t correspond to real user behavior.

Model Transaction Workflows, Not Just Endpoints

A user doesn’t hit /api/checkout. They authenticate, browse products, add items to a cart, enter payment details, and confirm an order. Each step depends on the previous step and carries state forward. Test complete workflows. Measure the whole transaction, not just individual request latency. This reveals which step in the workflow breaks first under load, which is your actual bottleneck. For transactional workflows, count the full transaction as your unit of measurement, not individual requests. A checkout that takes 12 requests and completes in 3 seconds is different from one that requires 12 requests and only completes 60% of the time under load.


The Test Environment

Your test environment is the single largest source of invalid load test results. Get this wrong and every metric, analysis, and conclusion downstream becomes unreliable.

Match Production Infrastructure

The test environment should match production in:

  • Instance types, sizes, and counts
  • Database configuration: connection pool size, cache allocation, index configuration, replica count
  • Caching layers and their sizes (this is a common miss a cache sized to 10% of production will warm and evict very differently)
  • Auto-scaling configuration and thresholds
  • Load balancer and network configuration
  • All service configurations that affect throughput or latency

Pay particular attention to cache sizes. Under-sized caches in test environments produce unrealistically high cache miss rates, which increases database load and makes your results look worse than production will be. Over-sized caches make things look better.

Use Representative Data Volumes

Test environments with small datasets produce misleading results. A database with 1 million rows behaves differently from one with 100 million rows in ways that are significant and non-linear. Index performance, query planner behavior, partition routing, and cache hit rates all change with data volume. Populate your test environment with data that reflects realistic production scale before running meaningful performance tests.

Isolate the Test Environment Completely

I have seen a load test takes down production environment because it shared a common infrastructure. A test environment that shares any infrastructure with production like databases, message queues, caching clusters, network paths, logging infrastructure creates two simultaneous problems: invalid test results (because production traffic contaminates your measurements) and potential production incidents (because your load test contaminates production systems). Shared test environments that connect to production Messaging bus, Kafka, or database clusters have caused outages. Enforce complete isolation.

Account for Test Data Accumulation

Load tests generate real data. After many test runs, your test database accumulates records, logs grow, and storage fills. Plan your test data lifecycle from the start, e.g., how you populate data before tests, whether you clean up between runs, and how you prevent accumulated test data from affecting your test environment’s performance over time.

Document Your Environment Specification

Version-control your environment definition alongside your test scripts. When you compare results across time, you need to know that what changed was the system under test, not the test environment. An environment specification that exists only in someone’s memory cannot be reproduced reliably.


Metrics: Collect the Right Things

Load testing generates a lot of data. The teams that extract the most value don’t collect more metrics, they collect the right metrics and actually analyze them.

Latency

Track percentiles, not averages. Averages hide tail behavior that determines user experience.

  • P50 — what the median user experiences
  • P90 — your common-case ceiling; nine in ten requests complete within this
  • P99 — your near-worst case; one in a hundred users waits this long
  • P99.9 — your extreme tail; relevant for high-volume services where 0.1% is still thousands of users

The gap between P50 and P99.9 tells you about consistency. A wide gap means some users experience good performance while others experience unacceptable degradation. Systems under load often hold P50 steady while P99 climbs.

Throughput

  • Requests per second: raw system throughput
  • Successful transactions per second: throughput filtered by correctness; throughput with a 20% error rate is not good throughput
  • Throughput per resource unit: requests per CPU core, per GB of memory helps with capacity planning

Error Rates

  • Fault rate — server-side failures (5xx responses)
  • Error rate — client rejections, throttled requests, timeouts
  • Error distribution — which specific errors, at what load levels

Don’t aggregate errors into a single rate. A 2% error rate composed entirely of timeouts tells you something different from a 2% error rate of connection refused responses. Decompose your error data and correlate specific error types with the load levels at which they appear.

Resource Utilization

Collect these for every component like application servers, databases, caches, message queues, load balancers, and load generators:

  • CPU: overall and per-core; watch for single-threaded bottlenecks where overall CPU looks fine but one core is maxed
  • Memory: heap usage, GC frequency and pause duration, swap usage; track memory over time in soak tests to detect leaks
  • Disk I/O: read and write throughput, queue depth, utilization percentage; relevant for databases and any service that writes logs or temp files
  • Network I/O: ingress and egress bytes per second, connection counts, dropped packets
  • Thread and connection pool utilization: active threads, queued requests, pool exhaustion events

Application-Level Metrics

  • Cache hit/miss/eviction rates: degrading hit rates under load reveal cache sizing or key distribution problems
  • Queue depths: growing queues indicate consumers can’t keep pace with producers
  • Database connection pool saturation: one of the most common failure modes under load
  • GC pause duration and frequency: GC pressure under load causes latency spikes that don’t show up in CPU metrics directly
  • Retry rates: high retry rates indicate a dependency is struggling, and may be amplifying load
  • Circuit breaker state: how often circuit breakers open under load, and what triggers them

Dependency-Level Metrics

When you include real dependencies in your test, monitor them as carefully as your own service:

  • Response latency from each dependency (P50, P99)
  • Error rates from each dependency
  • Dependency-side resource utilization (if you have access)
  • Message bus ingress and egress (if applicable)
  • Partition utilization for distributed storage systems

When a dependency is slow or erroring, that signal propagates through your system as elevated latency and errors in your own metrics. You need dependency-level metrics to trace the source.

Availability

Define availability targets before testing:

  • Service availability: percentage of requests that succeed
  • Per-endpoint availability: some endpoints degrade before others; measure them independently
  • Dependency availability: availability of each system your service calls

Business-Level Metrics

The most important metrics are often furthest from the infrastructure:

  • Orders completed per minute
  • Successful authentication rate
  • Payment processing completion rate
  • Data write confirmation rate

Infrastructure metrics tell you what the system is doing. Business metrics tell you what users are experiencing. A system where P99 latency stays within SLA but checkout completion drops 15% under load has a problem that infrastructure metrics alone won’t reveal clearly.


Tools

Over the years, I have used various commercial and open source tools like LoadRunner, Grinder, Tsung, etc. that are no longer well maintained. Here are common tools that can be used for load testing:

For Simple Endpoint Testing

  • ab (Apache Bench) and Hey: Command-line tools that generate load against a single endpoint. No scripting required, fast to start.
  • Vegeta: Generates load at a constant request rate, independent of server response time. This distinction matters: when your server responds slowly, most tools automatically reduce request rate. Vegeta maintains the configured rate as latency climbs, which means you observe back-pressure and degradation accurately.
echo "GET https://api.example.com/users/123" | vegeta attack -rate=500 -duration=60s | vegeta report
  • k6: Scripted in JavaScript, distributed as a single Go binary. k6 handles multi-step scenarios natively, supports parameterized test data, models think time, and exposes rich built-in metrics. It integrates with Prometheus, CloudWatch, and Grafana for analysis, and supports threshold-based pass/fail in CI pipelines.
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp to 100 users
    { duration: '5m', target: 100 },   // hold at average load
    { duration: '2m', target: 500 },   // ramp to peak
    { duration: '5m', target: 500 },   // hold at peak
    { duration: '1m', target: 1000 },  // spike
    { duration: '2m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],  // 99% of requests under 500ms
    http_req_failed: ['rate<0.01'],    // less than 1% error rate
  },
};

export default function () {
  // Step 1: authenticate
  const loginRes = http.post('https://api.example.com/auth/login', {
    username: `user_${__VU % 10000}@example.com`,
    password: 'password',
  });
  check(loginRes, { 'login succeeded': (r) => r.status === 200 });

  const token = loginRes.json('token');

  // Step 2: fetch catalog (read operation)
  const catalogRes = http.get('https://api.example.com/catalog?page=1', {
    headers: { Authorization: `Bearer ${token}` },
  });
  check(catalogRes, { 'catalog loaded': (r) => r.status === 200 });

  sleep(Math.random() * 3 + 1); // think time: 1-4 seconds

  // Step 3: place order (write operation)
  const orderRes = http.post(
    'https://api.example.com/orders',
    JSON.stringify({ item_id: Math.floor(Math.random() * 1000), quantity: 1 }),
    { headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' } }
  );
  check(orderRes, { 'order placed': (r) => r.status === 201 });

  sleep(Math.random() * 2 + 1);
}
  • Apache JMeter: Meter supports complex scenarios through a GUI, handles correlation between requests, has a broad plugin ecosystem, and has extensive enterprise adoption.
  • Locust: Pure Python, code-defined test scenarios (not XML), a built-in web UI for real-time monitoring, distributed mode via a controller/worker model, and trivially scriptable.

For Distributed Load Generation

AWS Distributed Load Testing: When a single machine can’t generate the volume you need, this solution orchestrates load across multiple instances, accepts JMeter scripts as the test definition, and streams results to time-series storage for analysis. Use it when your bandwidth or TPS requirements exceed what a single load generator can produce.

For Observability During Tests

You can use following monitoring stack to gather performance metrics:

  • Prometheus + Grafana: commonly used for infrastructure and application metrics; k6 exports directly to Prometheus
  • CloudWatch: native AWS monitoring; integrates with most AWS services and many load testing tools
  • Distributed tracing (Jaeger, Zipkin, AWS X-Ray): essential for understanding latency in distributed systems; propagate correlation IDs through every service boundary so you can trace a slow request to the specific component that caused it

Without distributed tracing, diagnosing latency in a multi-service system under load is largely guesswork.


Execution

  • Warm Up Before Measuring: JIT compilation, connection pool initialization, cache population, and DNS resolution all affect early request latency. Build a ramp-up period into every test. Discard metrics from the warmup phase. Measure steady-state behavior only.
  • Verify Your Load Generator Isn’t the Bottleneck: Before trusting any results, confirm: load generator CPU stays well below saturation (under 70%), network I/O doesn’t approach the bandwidth ceiling, and the tool achieves the TPS you configured not a lower number due to local resource constraints. If you configure 1,000 TPS but the generator only achieves 600, your results reflect the generator’s limits, not your system’s.
  • Notify Dependent Teams Before Testing: If your test environment shares any infrastructure with other teams, notify them before running high-volume tests. Unexpected load from your tests against a shared component (a database, a message bus, a routing layer) can cause problems for teams who have no idea a load test is running.
  • Run Each Scenario in Isolation First: Test each scenario independently before running combinations. An isolated test that reveals a problem gives you more diagnostic information than a combined test that reveals the same problem buried in noise from other scenarios.
  • Don’t Overwrite Previous Results: Each test run should write to a new, timestamped output file. Overwriting results from a previous run is a common mistake when running iterative tests in a loop. You lose the ability to compare across runs.
  • Pause Between Runs: Allow the system to fully drain between test iterations like connections close, queues clear, resource utilization returns to baseline. Residual load from one run contaminates the starting conditions of the next.

Common Pitfalls

  • Testing a single endpoint and calling it done. A service’s behavior under load isn’t determined by any single endpoint. Test complete workflows, including the paths that matter most to users.
  • Ignoring dependencies. When your dependencies are slow or unavailable, your service appears slow. When your service hammers a dependency with load, the dependency may degrade and create a feedback loop. Model dependency behavior explicitly and mock it when you want to isolate your own code, use real or realistic stubs when integration behavior matters.
  • Mismatch between test environment and production. Different hardware, different cache sizes, different connection pool limits, different network latency profiles, any of these make test results non-transferable to production. Document your environment specification. Validate that it matches production before trusting results.
  • Small data volumes. A test environment with 1% of production data volume produces optimistic results. Populate test data to realistic scale.
  • Running load tests once. Performance characteristics change with every code change, every dependency upgrade, and every growth milestone. A load test you ran six months ago tells you about a system that no longer exists.
  • Ignoring ramp-down. Verify that resource utilization returns to baseline after load subsides. A system that doesn’t recover cleanly carries forward pressure that degrades subsequent traffic.
  • Not collecting metrics from all layers. Application-level metrics without infrastructure metrics leave you guessing about root cause. Infrastructure metrics without application or business-level metrics leave you unable to quantify user impact. Collect all three.
  • Stopping tests when something goes wrong instead of analyzing the failure mode. When a stress test surfaces a failure, that’s the point. Note what failed, under what conditions, and how the system behaved. Stopping the test immediately loses the degradation data that tells you whether the failure mode is safe or catastrophic.

Analysis

  • Establish a Baseline Before Comparing Anything: Every metric needs a reference point. P99 latency of 300ms is good or bad depending entirely on what P99 looks like at baseline load. Run a baseline test with minimal concurrent users before escalating. Capture that baseline explicitly. Compare every subsequent measurement against it.
  • Separate Signal from Noise: A single high-latency data point is noise. A systematic increase in P99 as concurrency crosses 500 users is signal. Look for the pattern: where does behavior change? At what load level? After what duration? What resource metric correlates with the change?
  • Trace Latency to Its Source: When you observe elevated latency, resist looking first at application CPU. Latency accumulates in many places: network round trips between services, database query execution, lock contention, GC pause accumulation, connection pool queuing, and downstream dependency latency. Distributed tracing lets you follow a slow request through every component it touched and attribute the latency precisely. Fix the actual source, not the nearest visible symptom.
  • Investigate Unexpectedly Good Results: If your system performs better than expected under load, investigate before celebrating. Unexpected improvement often means your test isn’t exercising the paths you intended such as caches warming too aggressively, load not reaching the components you think it is, or test data creating unrealistic access patterns. Results you can’t explain aren’t results you can rely on.
  • Generate Comparative Reports: A report listing numbers has limited value. A report comparing those numbers to your baseline, to your previous test run, and to your defined thresholds has significant value. For each metric, capture:
    – Current result
    – Baseline value
    – SLA or target threshold
    – Previous test result (regression or improvement?)
    – Load level at which the metric was captured

Store test results in a queryable format over time.


Building a Continuous Performance Practice

The teams with the most reliable services don’t treat performance testing as a project. They treat it as a discipline with regular cadence.

  • Define performance goals and revisit them annually. Goals should include throughput targets, latency percentiles, error rate limits, resource utilization ceilings, and headroom targets (how much capacity should remain available at peak). As your traffic patterns change, your service evolves, and your SLAs tighten, these goals need updating.
  • Automate pass/fail thresholds in CI. Encode your performance targets as pipeline gates. A change that increases P99 latency by 40% under load should fail the build, the same way a change that breaks a unit test fails the build.
  • Run performance canaries in production. Continuously exercise production endpoints at low volume from monitoring infrastructure. Track latency, error rates, and throughput over time. Detect gradual degradation before users do.
  • Assign a performance owner on each service team. Performance improvements don’t happen without someone watching the metrics, reviewing throttling rules, identifying regressions, and driving improvements.
  • Review results across time for patterns. Look at all your load test results over the past quarter. Which metrics trend in the wrong direction? Which components appear repeatedly in bottleneck analysis? Patterns across multiple tests reveal systemic issues that any individual test misses.
  • Share what you learn. Performance problems and their solutions are valuable organizational knowledge. Document them. Share them across teams. The team dealing with connection pool exhaustion today is probably not the first team to hit that issue.

The Pre-Test Checklist

Before any load test:

  • [ ] Test objectives and pass/fail thresholds defined in writing before execution
  • [ ] Test environment completely isolated from production
  • [ ] Test environment infrastructure matches production configuration (instance types, cache sizes, connection pools, scaling settings)
  • [ ] Test data populated to realistic production scale
  • [ ] Dependent services decided: mock, stub, or real — with rationale documented
  • [ ] Monitoring dashboards active for all components, including load generators
  • [ ] Dependent team on-call contacts notified
  • [ ] Output file naming prevents overwrites between iterations
  • [ ] Previous test results available for comparison

During execution:

  • [ ] Baseline captured before escalating load
  • [ ] Load generator resource utilization verified (not the bottleneck)
  • [ ] Error rates monitored in real time — abnormal errors trigger a pause for investigation
  • [ ] Each step held long enough for metrics to stabilize
  • [ ] Auto-scaling events logged with timestamps

After execution:

  • [ ] Results compared to defined thresholds and previous runs
  • [ ] Anomalies investigated before conclusions are drawn
  • [ ] Root cause documented for any threshold violations
  • [ ] Action items assigned with owners and deadlines
  • [ ] Test results stored in versioned, queryable storage
  • [ ] Environment cleanup completed (test data, log files, temporary resources)

Putting It Together

Any code change can affect performance. A dependency upgrade, a new index, a configuration tweak, a framework version bump — all of these can change memory footprint, CPU usage, throughput, and latency in ways that don’t appear until you run real load. The only reliable way to catch these changes before they affect users is to make performance testing a routine part of how you build and ship software, not something you do once before a big launch.

Start with profiling to understand where time and memory go in your own code. Add load tests to your CI pipeline to catch regressions early. Run soak tests to find memory and connection leaks. Stress test to 10x your expected peak so you know what your ceiling looks like and how you fail when you hit it. Test with real dependency behavior when integration effects matter, and mock dependencies when you want to isolate your own code.

Collect metrics at every layer such as application, infrastructure, and business so you can connect a latency spike to its root cause and quantify its user impact. Store results over time so you can detect gradual regressions before they become incidents. The goal is to know your system well enough that production behavior matches what you measured in testing.


February 27, 2026

Building Polyglot and Serverless Applications with WebAssembly

Filed under: Computing — admin @ 7:46 pm

Over the years, I have watched distributed services evolve through phases I lived through personally such as CORBA, EJB, SOA, REST microservices, containers WebAssembly feels different. It compiles code from any language into a universal binary format, runs it in a sandboxed environment, and delivers near-native performance without containers or language-specific runtimes cluttering your production stack.

When I built PlexSpaces for serverless FaaS applications, I designed its polyglot layer on top of WebAssembly and the WASI Component Model. It allows you to write actors in Python, Rust, Go, or TypeScript, compile them to WASM, and deploy them to the same runtime. The framework handles persistence, fault tolerance, supervision, and scaling regardless of programming language. In this post, I’ll walk you through the core WebAssembly concepts, show how PlexSpaces leverages the Component Model for polyglot development, and demonstrate building, testing, and deploying applications in all four languages. I’ll also show a PlexSpaces Application Server model that lets you deploy entire application bundleslike deploying a WAR file to Tomcat, but with the fault-tolerance of Erlang/OTP built in.


WebAssembly Introduction

WebAssembly launched in 2017 as a browser technology. I ignored it for years — client-side JavaScript ecosystem drama wasn’t something I wanted to track. The server-side story changed everything.

How WebAssembly Executes Code

WebAssembly is a stack-based virtual machine that executes a compact binary instruction format. Every language that compiles to WASM follows the same pipeline:

The WASM binary format encodes typed functions, a linear memory model, and a set of imports and exports. The runtime validates the binary at load time, then executes it using either just-in-time (JIT) compilation or ahead-of-time (AOT) compilation to native machine code. Three properties make this execution model powerful for distributed systems:

Deterministic execution. Given the same inputs, a WASM module produces the same outputs. This property underpins PlexSpaces’ durable execution, which replays journaled messages through the same WASM binary and arrives at the exact same state.

Memory isolation. Each WASM instance gets its own linear memory. One module cannot read, write, or corrupt another module’s memory. No shared-memory race conditions and buffer overflows escaping the sandbox. The runtime enforces these boundaries at the hardware level.

Capability-based security. A WASM module starts with zero capabilities. It cannot access the filesystem, the network, or even a clock unless the host explicitly provides each capability through imported functions. PlexSpaces grants actors exactly the capabilities they need like messaging, key-value storage, tuple spaces.

The Component Model

Early WebAssembly only understood numbers. You passed integers and floats across the boundary, and that was it. The WebAssembly Component Model fixes this limitation by defining rich, typed interfaces that components use to communicate. You can think of it as an IDL (Interface Definition Language) for WASM but one that works across every language. The key building blocks:

  • WIT (WebAssembly Interface Types): A language for defining typed function signatures across components. A function defined in WIT can accept strings, records, lists, variants, and enums. WIT bridges the type systems of Rust, Python, Go, and TypeScript into a single, shared contract.
  • Components: Self-contained WASM modules that declare their imports (what they need from the host) and exports (what they provide). A Rust component and a Python component that implement the same WIT interface become interchangeable at the binary level.
  • WASI (WebAssembly System Interface): The standardized API that gives WASM modules access to system resources like file I/O, networking, clocks, and random number generation within the sandbox. WASI Preview 2 shipped in 2024 with HTTP, filesystem, and socket support. WASI 0.3, released in February 2026, added native async support for concurrent I/O.

Wasm 3.0 and WasmGC

The WebAssembly ecosystem crossed a critical threshold. Wasm 3.0 became the W3C standard in 2025, standardizing nine production features in a single release:

  • WasmGC: garbage collection support built into the runtime, eliminating the need for languages like Go, Python, and Java to ship their own GC inside the WASM binary. This shrinks binary sizes and improves performance for GC-dependent languages dramatically.
  • Exception handling: structured try/catch at the WASM level, replacing the expensive setjmp/longjmp workarounds that inflated binaries.
  • Tail calls: proper tail call optimization for functional programming patterns without stack overflow.
  • SIMD (Single Instruction, Multiple Data): vector operations for parallel numeric computation, critical for ML inference and scientific workloads.

For PlexSpaces, WasmGC means Go and Python actors run faster with smaller binaries. SIMD means computational actors like n-body simulations, matrix multiplies, genomics pipelines that process data at near-native throughput inside the sandbox.

What This Means in Practice

You compile a Python actor and a Rust actor to WASM. Both implement the same WIT interface. The runtime loads them identically, calls the same exported functions, and provides the same host capabilities like messaging, key-value storage, tuple spaces, distributed locks. The Python actor handles ML inference; the Rust actor handles high-throughput event processing. They communicate through PlexSpaces message passing without knowing or caring which language sits on the other side.

This is not “Write Once, Run Anywhere” in the old Java sense. This is “Write in Whatever Language Fits, Run Together on the Same Runtime.”


How PlexSpaces Makes It Work

PlexSpaces is a unified distributed actor framework that combines patterns from Erlang/OTP, Orleans, Temporal, and modern serverless architectures into a single abstraction. I described the five foundational pillars in my earlier post: TupleSpace coordination, Erlang/OTP supervision, durable execution, WASM runtime, and Firecracker isolation. Here I focus on the WASM layer and how it enables polyglot development.

Architecture at a Glance

The WIT Contract for Actor

Every actor regardless of source language targets the same WIT world. Here is the simplified world that most polyglot actors use:

// wit/plexspaces-simple-actor/world.wit
package plexspaces:simple-actor@0.1.0;

interface actor {
    // Initialize with JSON config string
    init: func(config-json: string) -> string;

    // Handle a message: route by msg-type, return JSON result
    handle: func(from-actor: string, msg-type: string,
                 payload-json: string) -> string;

    // Snapshot state for persistence
    get-state: func() -> string;

    // Restore state from snapshot
    set-state: func(state-json: string) -> string;
}

interface host {
    // Messaging
    send: func(to: string, msg-type: string, payload-json: string) -> string;
    ask: func(to: string, msg-type: string, payload-json: string,
              timeout-ms: u64) -> string;
    spawn: func(module-ref: string, actor-id: string,
                init-config-json: string) -> string;
    stop: func(actor-id: string) -> string;
    self-id: func() -> string;

    // Erlang/OTP-style linking and monitoring
    link: func(actor-id: string) -> string;
    monitor: func(actor-id: string) -> string;

    // Timers
    send-after: func(delay-ms: u64, msg-type: string,
                     payload-json: string) -> string;

    // Process groups
    pg-join: func(group-name: string) -> string;
    pg-broadcast: func(group-name: string, msg-type: string,
                       payload-json: string) -> string;

    // Key-value store
    kv-get: func(key: string) -> string;
    kv-put: func(key: string, value: string) -> string;
    kv-delete: func(key: string) -> string;
    kv-list: func(prefix: string) -> string;

    // TupleSpace (Linda-style coordination)
    ts-write: func(tuple-json: string) -> string;
    ts-read: func(pattern-json: string) -> string;
    ts-take: func(pattern-json: string) -> string;
    ts-read-all: func(pattern-json: string) -> string;

    // Distributed locks
    lock-acquire: func(tenant-id: string, namespace: string,
                       holder-id: string, lock-name: string,
                       lease-duration-secs: u32, timeout-ms: u64) -> string;
    lock-release: func(lock-id: string, tenant-id: string,
                       namespace: string, holder-id: string,
                       lock-version: string) -> string;

    // Blob storage
    blob-upload: func(blob-id: string, data: string,
                      content-type: string) -> string;
    blob-download: func(blob-id: string) -> string;

    // Logging and time
    log: func(level: string, message: string);
    now-ms: func() -> u64;
}

world actor-world {
    import host;
    export actor;
}

The full-featured actor package adds dedicated WIT interfaces for workflows, channels, durability/journaling, registry/service discovery, HTTP client, and cron scheduling. PlexSpaces also defines specialized worlds that import only the capabilities each actor needs:

WIT WorldImportsUse Case
plexspaces-actorAll 13 interfacesFull-featured actors needing every capability
simple-actorMessaging + LoggingLightweight stateless workers
durable-actorMessaging + DurabilityActors with crash recovery and journaling
coordination-actorMessaging + TupleSpaceActors coordinating through shared tuple space
event-actorMessaging + ChannelsEvent-driven actors using queues and topics

This design keeps WASM binaries small. A simple actor that only needs messaging imports two interfaces not thirteen.

Language Toolchains

Each language uses a different compiler to produce WASM, but the output targets the same runtime:

LanguageCompilerWASM SizePerformanceBest For
Rustcargo (wasm32-wasip2)100KB-1MBExcellentProduction, performance-critical paths
Gotinygo2-5MBGoodBalanced performance, fast iteration
TypeScriptjco componentize500KB-2MBGoodWeb integration, rapid development
Pythoncomponentize-py30-40MBModerateML inference, data processing, prototyping

Now let’s build something real in each language.


Getting Started

Before diving into the language examples, set up your development environment.

Prerequisites

  • Rust 1.70+ (for building PlexSpaces itself)
  • Docker (optional — for the fastest path to a running node)
  • One or more WASM compilers for your target languages (see below)

Option 1: Docker Quickstart

Pull and run a PlexSpaces node in seconds:

# Pull the official image
docker pull plexobject/plexspaces:latest

# Run a single node with HTTP API on port 8001
docker run -d \
    --name plexspaces-node \
    -p 8000:8000 \
    -p 8001:8001 \
    -e PLEXSPACES_NODE_ID=node1 \
    -e PLEXSPACES_DISABLE_AUTH=1 \
    plexobject/plexspaces:latest

The node exposes a gRPC endpoint on port 8000 and an HTTP/REST gateway on port 8001 with interactive Swagger UI documentation.

Option 2: Build from Source

git clone https://github.com/bhatti/PlexSpaces.git
cd PlexSpaces

./scripts/server.sh

# Or use the Makefile step by step
make build            # Build all crates
make test             # Run all tests

Install Language Compilers

Install the WASM compiler for each language you plan to use:

# Rust (produces the smallest, fastest WASM)
rustup target add wasm32-wasip2

# Go (pragmatic balance of performance and dev speed)
# macOS:
brew install tinygo
# Also need wasm-tools for component creation:
cargo install wasm-tools

# TypeScript (rapid development, web ecosystem)
npm install -g @bytecodealliance/jco

# Python (ML, data processing, prototyping)
pip install componentize-py

# Optional: WASM binary optimizer (shrinks binaries further)
cargo install wasm-opt

Start the Node and Deploy Your First Actor

# Start a PlexSpaces node (from source)
cargo run --release --bin plexspaces -- start \
    --node-id dev-node \
    --listen-addr 0.0.0.0:8000 \
    --release-config release-config.toml


# Deploy a WASM actor (from any language)
curl -X POST http://localhost:8001/api/v1/applications/deploy \
    -F "application_id=my-app" \
    -F "name=my-actor" \
    -F "version=1.0.0" \
    -F "wasm_file=@my_actor.wasm"

# Send it a message
curl -X POST http://localhost:8001/api/v1/actors/my-app/ask \
    -H "Content-Type: application/json" \
    -d '{"message_type": "hello", "payload": {}}'

Now let’s build real actors in each language.


Python: A Calculator Actor with the SDK

Python shines for rapid prototyping and data-heavy workloads. The PlexSpaces Python SDK uses decorators (@actor, @handler, state()) that eliminate boilerplate and let you focus on business logic.

The Actor Code

# calculator_actor.py
from plexspaces import actor, state, handler, init_handler


@actor
class Calculator:
    """Calculator actor implementing basic math operations."""

    # Persistent state fields -- survive crashes via journaling
    last_operation: str = state(default=None)
    last_result: float = state(default=None)
    history: list = state(default_factory=list)

    @init_handler
    def on_init(self, config: dict):
        """Initialize calculator with optional config."""
        if "state" in config:
            saved = config["state"]
            self.last_operation = saved.get("last_operation")
            self.last_result = saved.get("last_result")
            self.history = saved.get("history", [])

    @handler("add")
    def add(self, operands: list = None) -> dict:
        """Add operands together."""
        result = sum(operands or [])
        self._record("add", operands, result)
        return {"result": result, "operation": "add"}

    @handler("subtract")
    def subtract(self, operands: list = None) -> dict:
        """Subtract: first operand minus rest."""
        if not operands or len(operands) < 2:
            return {"error": "Subtract requires at least 2 operands"}
        result = operands[0] - sum(operands[1:])
        self._record("subtract", operands, result)
        return {"result": result, "operation": "subtract"}

    @handler("multiply")
    def multiply(self, operands: list = None) -> dict:
        """Multiply all operands."""
        result = 1
        for op in (operands or []):
            result *= op
        self._record("multiply", operands, result)
        return {"result": result, "operation": "multiply"}

    @handler("divide")
    def divide(self, operands: list = None) -> dict:
        """Divide first operand by second."""
        if not operands or len(operands) < 2:
            return {"error": "Divide requires 2 operands"}
        if operands[1] == 0:
            return {"error": "Division by zero"}
        result = operands[0] / operands[1]
        self._record("divide", operands, result)
        return {"result": result, "operation": "divide"}

    @handler("get_history")
    def get_history(self) -> dict:
        """Return calculation history."""
        return {"history": self.history}

    @handler("call", "get_state")
    def get_state_handler(self) -> dict:
        """Snapshot current state."""
        return {
            "last_operation": self.last_operation,
            "last_result": self.last_result,
            "history": self.history,
        }

    def _record(self, operation, operands, result):
        self.last_operation = operation
        self.last_result = result
        self.history.append({
            "operation": operation,
            "operands": operands,
            "result": result,
        })

Notice how the @actor decorator marks the class, state() declares persistent fields that survive crashes, and each @handler("operation") routes incoming messages to the right method. The SDK handles WIT serialization, state checkpointing, and all the plumbing underneath.

Build and Deploy

# Install the Python SDK
pip install -e "sdks/python/[dev]"

# Build WASM using the SDK CLI
plexspaces-py build calculator_actor.py \
    -o calculator_actor.wasm \
    --wit-dir wit/plexspaces-simple-actor

# Deploy the WASM module
curl -X POST http://localhost:8001/api/v1/applications/deploy \
    -F "application_id=calculator-app" \
    -F "name=calculator" \
    -F "version=1.0.0" \
    -F "wasm_file=@calculator_actor.wasm"

Send a Request

curl -X POST http://localhost:8001/api/v1/actors/calculator-app/ask \
    -H "Content-Type: application/json" \
    -d '{"message_type": "add", "payload": {"operands": [10, 20, 30]}}'

# Response: {"result": 60, "operation": "add"}

The actor processes the request, updates its persistent state, and returns the result. If the node crashes and restarts, the framework replays the journal and restores the calculator’s state .


TypeScript: A Bank Account with Durable State

TypeScript brings type safety and rapid development. The PlexSpaces TypeScript SDK uses an inheritance-based pattern: extend PlexSpacesActor, implement on<Operation>() handlers, and the SDK wires everything to WIT.

The Actor Code

// account_actor.ts
import { PlexSpacesActor } from "@plexspaces/sdk";

interface Transaction {
  type: string;
  amount: number;
  balance_after: number;
}

interface BankAccountState {
  account_id: string;
  balance: number;
  transactions: Transaction[];
}

export class BankAccountActor extends PlexSpacesActor<BankAccountState> {
  getDefaultState(): BankAccountState {
    return { account_id: "", balance: 0, transactions: [] };
  }

  protected override onInit(config: Record<string, unknown>): void {
    this.state.account_id = String(config.account_id ?? "");
    this.state.balance = 0;
    this.state.transactions = [];
  }

  onDeposit(payload: Record<string, unknown>): Record<string, unknown> {
    const amount = Number(payload.amount ?? 0);
    if (amount <= 0) return { error: "invalid_amount" };
    this.state.balance += amount;
    this.state.transactions.push({
      type: "deposit", amount, balance_after: this.state.balance,
    });
    return { status: "ok", balance: this.state.balance };
  }

  onWithdraw(payload: Record<string, unknown>): Record<string, unknown> {
    const amount = Number(payload.amount ?? 0);
    if (amount <= 0) return { error: "invalid_amount" };
    if (amount > this.state.balance) {
      return { error: "insufficient_funds", balance: this.state.balance };
    }
    this.state.balance -= amount;
    this.state.transactions.push({
      type: "withdraw", amount, balance_after: this.state.balance,
    });
    return { status: "ok", balance: this.state.balance };
  }

  onHistory(payload: Record<string, unknown>): Record<string, unknown> {
    const count = Math.min(
      Number(payload.count ?? 5), this.state.transactions.length
    );
    return { transactions: this.state.transactions.slice(-count) };
  }

  onReplay(): Record<string, unknown> {
    let rebuilt = 0;
    for (const tx of this.state.transactions) {
      if (tx.type === "deposit") rebuilt += tx.amount;
      else if (tx.type === "withdraw") rebuilt -= tx.amount;
    }
    return {
      replayed: this.state.transactions.length,
      rebuilt_balance: rebuilt,
      current_balance: this.state.balance,
    };
  }
}

// WIT actor export -- bridges TypeScript class to the WIT interface
const instance = new BankAccountActor();
export const actor = {
  init: (c: string) => instance.init(c),
  handle: (from: string, msg: string, payload: string) =>
    instance.handle(from, msg, payload),
  getState: () => instance.getState(),
  setState: (s: string) => instance.setState(s),
};

The BankAccountActor manages deposits, withdrawals, and transaction history with full durability. The onReplay() handler rebuilds the balance from the transaction log, demonstrating event-sourcing patterns that the framework makes trivial.

Build and Deploy

The TypeScript build uses a three-step pipeline: compile TypeScript, bundle with esbuild, then create a WASM component with jco:

# Install dependencies (SDK is a file: dependency)
npm install

# Compile TypeScript -> JavaScript -> ESM bundle -> WASM component
npm run build              # tsc + esbuild bundle
jco componentize actor_bundle.mjs \
    --wit wit/plexspaces-simple-actor \
    -o account_actor.wasm \
    --disable all

# Deploy to PlexSpaces
curl -X POST http://localhost:8001/api/v1/applications/deploy \
    -F "application_id=bank-app" \
    -F "name=bank-account" \
    -F "version=1.0.0" \
    -F "wasm_file=@account_actor.wasm"

Interact with the Accounts

# Deposit into Alice's account
curl -X POST http://localhost:8001/api/v1/actors/account-alice/ask \
    -H "Content-Type: application/json" \
    -d '{"message_type": "deposit", "payload": {"amount": 1000}}'
# Response: {"status": "ok", "balance": 1000}

# Withdraw from Alice's account
curl -X POST http://localhost:8001/api/v1/actors/account-alice/ask \
    -H "Content-Type: application/json" \
    -d '{"message_type": "withdraw", "payload": {"amount": 250}}'
# Response: {"status": "ok", "balance": 750}

# Check transaction history
curl -X POST http://localhost:8001/api/v1/actors/account-alice/ask \
    -H "Content-Type: application/json" \
    -d '{"message_type": "history", "payload": {"count": 10}}'

Go: An Erlang OTP-Style Rate Limiter

Go delivers a pragmatic balance between performance and developer productivity. The PlexSpaces Go SDK uses an interface-based pattern: implement the Actor interface, embed BaseActor for automatic state serialization, and register your actor for WASM export via plexspaces.Register().

The Actor Code

This example implements a sliding-window rate limiter, the kind you find inside API gateways like NGINX, Kong, or Envoy. Each client gets an independent window with configurable limits:

// rate_limiter.go
package main

import (
    "encoding/json"
    "fmt"
    "github.com/plexobject/plexspaces/sdks/go/plexspaces"
)

type SlidingWindowLimiter struct {
    plexspaces.BaseActor

    WindowSizeMs uint64                    `json:"window_size_ms"`
    MaxRequests  int                       `json:"max_requests"`
    Clients      map[string]*ClientWindow  `json:"clients"`
    TotalChecks  int                       `json:"total_checks"`
    TotalAllowed int                       `json:"total_allowed"`
    TotalDenied  int                       `json:"total_denied"`
}

type ClientWindow struct {
    Timestamps []uint64 `json:"timestamps"`
    Allowed    int      `json:"allowed"`
    Denied     int      `json:"denied"`
}

var host = plexspaces.NewHost()

func NewSlidingWindowLimiter() *SlidingWindowLimiter {
    a := &SlidingWindowLimiter{
        WindowSizeMs: 60000,
        MaxRequests:  100,
        Clients:      make(map[string]*ClientWindow),
    }
    a.SetSelf(a) // enables automatic JSON state serialization
    return a
}

func (s *SlidingWindowLimiter) Init(configJSON string) string {
    var config struct {
        ActorID string         `json:"actor_id"`
        Args    map[string]any `json:"args"`
    }
    json.Unmarshal([]byte(configJSON), &config)

    if args := config.Args; args != nil {
        if v, ok := args["window_size_ms"]; ok {
            s.WindowSizeMs = uint64(v.(float64))
        }
        if v, ok := args["max_requests"]; ok {
            s.MaxRequests = int(v.(float64))
        }
    }

    host.Info(fmt.Sprintf("RateLimiter: window=%dms, max=%d req/window",
        s.WindowSizeMs, s.MaxRequests))
    return ""
}

func (s *SlidingWindowLimiter) Handle(from, msgType, payloadJSON string) string {
    switch msgType {
    case "check_rate":
        return s.checkRate(payloadJSON)
    case "stats":
        return s.getStats()
    default:
        data, _ := json.Marshal(map[string]any{"error": "unknown: " + msgType})
        return string(data)
    }
}

func (s *SlidingWindowLimiter) checkRate(payloadJSON string) string {
    var req struct { ClientID string `json:"client_id"` }
    json.Unmarshal([]byte(payloadJSON), &req)

    window, exists := s.Clients[req.ClientID]
    if !exists {
        window = &ClientWindow{Timestamps: make([]uint64, 0)}
        s.Clients[req.ClientID] = window
    }

    now := host.NowMs()
    cutoff := now - s.WindowSizeMs

    // Slide the window: remove expired timestamps
    var active []uint64
    for _, ts := range window.Timestamps {
        if ts > cutoff { active = append(active, ts) }
    }
    window.Timestamps = active

    // Check the limit
    allowed := len(window.Timestamps) < s.MaxRequests
    if allowed {
        window.Timestamps = append(window.Timestamps, now)
        window.Allowed++; s.TotalAllowed++
    } else {
        window.Denied++; s.TotalDenied++
    }
    s.TotalChecks++

    remaining := s.MaxRequests - len(window.Timestamps)
    if remaining < 0 { remaining = 0 }

    data, _ := json.Marshal(map[string]any{
        "allowed": allowed, "remaining": remaining,
        "limit": s.MaxRequests, "client_id": req.ClientID,
    })
    return string(data)
}

// Register the actor for WASM export -- runs during _initialize,
// before the host calls any exported functions.
func init() {
    plexspaces.Register(NewSlidingWindowLimiter())
}

func main() {}

The Go SDK pattern uses plexspaces.NewHost() to access all host functions (messaging, KV, tuple space, etc.) and plexspaces.Register() in the init() function to wire the actor to the WASM export interface. The comparison to Erlang/OTP maps directly:

Erlang/OTPPlexSpaces Go
gen_server:start_link/3Supervisor in app-config.toml
handle_call/3Handle(from, msgType, payload)
#state{} recordGo struct with JSON tags
gen_server:call(Pid, Msg)host.Ask(actorID, msgType, data)
application:start/2app-config.toml

Build and Deploy

The Go build uses a three-step TinyGo pipeline: compile to core WASM, embed WIT metadata, then create a WASM component with a WASI adapter:

# Step 1: Compile Go to core WASM
tinygo build -target=wasi -o rate_limiter_core.wasm .

# Step 2: Embed WIT metadata
wasm-tools component embed wit/plexspaces-simple-actor \
    -w actor-world rate_limiter_core.wasm -o rate_limiter_embed.wasm

# Step 3: Create WASM component with WASI adapter
wasm-tools component new rate_limiter_embed.wasm \
    --adapt wasi_snapshot_preview1.reactor.wasm \
    -o rate_limiter.wasm

# Deploy
curl -X POST http://localhost:8001/api/v1/applications/deploy \
    -F "application_id=rate-limiter-app" \
    -F "name=rate-limiter" \
    -F "version=1.0.0" \
    -F "wasm_file=@rate_limiter.wasm"

Each Go example includes a build.sh script that automates this pipeline and resolves the WASI adapter automatically.

Test Rate Limiting

# Check if a client request is allowed
curl -X POST http://localhost:8001/api/v1/actors/rate-limiter/ask \
    -H "Content-Type: application/json" \
    -d '{"message_type": "check_rate", "payload": {"client_id": "api-client-1"}}'
# Response: {"allowed": true, "remaining": 99, "limit": 100, "client_id": "api-client-1"}

# After 100 requests within the window:
# Response: {"allowed": false, "remaining": 0, "limit": 100, "client_id": "api-client-1"}

Rust: A Calculator with Maximum Performance

Rust produces the smallest, fastest WASM binaries. When you need every microsecond like high-frequency trading, real-time event processing, computational pipelines. Rust actors deliver near-native performance with binary sizes under 1MB.

The Actor Code

This calculator uses #![no_std] to eliminate the standard library entirely, producing a tiny, self-contained WASM module:

// lib.rs
#![no_std]
extern crate alloc;

use alloc::vec::Vec;
use core::slice;
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "PascalCase")]
pub enum Operation { Add, Subtract, Multiply, Divide }

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CalculatorState {
    calculation_count: u64,
    last_result: Option<f64>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CalculationRequest {
    operation: Operation,
    operands: Vec<f64>,
}

static mut STATE: CalculatorState = CalculatorState {
    calculation_count: 0, last_result: None,
};

/// Initialize actor with optional persisted state
#[no_mangle]
pub extern "C" fn init(state_ptr: *const u8, state_len: usize) -> i32 {
    unsafe {
        if state_len == 0 {
            STATE = CalculatorState { calculation_count: 0, last_result: None };
            return 0;
        }
        let state_bytes = slice::from_raw_parts(state_ptr, state_len);
        match serde_json::from_slice::<CalculatorState>(state_bytes) {
            Ok(state) => { STATE = state; 0 }
            Err(_) => -1,
        }
    }
}

/// Handle incoming calculation requests
#[no_mangle]
pub extern "C" fn handle_message(
    _from_ptr: *const u8, _from_len: usize,
    type_ptr: *const u8, type_len: usize,
    payload_ptr: *const u8, payload_len: usize,
) -> *const u8 {
    unsafe {
        let msg_type = core::str::from_utf8(
            slice::from_raw_parts(type_ptr, type_len)
        ).unwrap_or("");

        match msg_type {
            "calculate" => {
                let payload = slice::from_raw_parts(payload_ptr, payload_len);
                if let Ok(req) = serde_json::from_slice::<CalculationRequest>(payload) {
                    if let Ok(result) = execute(&req) {
                        STATE.calculation_count += 1;
                        STATE.last_result = Some(result);
                    }
                }
                core::ptr::null()
            }
            _ => core::ptr::null(),
        }
    }
}

fn execute(req: &CalculationRequest) -> Result<f64, &'static str> {
    let (a, b) = (req.operands[0], req.operands[1]);
    match req.operation {
        Operation::Add      => Ok(a + b),
        Operation::Subtract => Ok(a - b),
        Operation::Multiply => Ok(a * b),
        Operation::Divide   => {
            if b == 0.0 { Err("Division by zero") } else { Ok(a / b) }
        }
    }
}

Build and Deploy

rustup target add wasm32-wasip2
cargo build --target wasm32-wasip2 --release

# Optimize the binary further
wasm-opt -Oz --strip-debug \
    target/wasm32-wasip2/release/calculator_wasm_actor.wasm \
    -o calculator_actor.wasm

# Deploy
curl -X POST http://localhost:8001/api/v1/applications/deploy \
    -F "application_id=rust-calc" \
    -F "name=calculator" \
    -F "version=1.0.0" \
    -F "wasm_file=@calculator_actor.wasm"

The resulting binary? Under 200KB. Compare that to a Python actor at 30-40MB or even a TypeScript actor at 1-2MB. When you deploy hundreds of actors per node, those size differences translate directly into memory savings and faster cold starts.


Deploying Applications

One of the patterns I find most compelling that I feel the serverless world has completely neglected is the idea of deploying whole applications, not just individual functions. If you have used Tomcat or JBoss, you understand what I mean. You package your application, hand it to the server, and the server takes care of running it, managing the process lifecycle, enforcing security policies, routing requests, collecting metrics, and handling restarts. You focus on business logic; the server handles the infrastructure cross-cuts. PlexSpaces brings this same model to WASM actors, but with Erlang/OTP’s supervision philosophy underneath. I call this the PlexSpaces Application Server model.

The Application Manifest

Instead of deploying actors one by one via API calls, you define an application bundle — a single manifest that describes your entire application topology: which actors to run, how they supervise each other, what resources they need, and what security policies apply to them.

[supervisor]
strategy = "one_for_one"
max_restarts = 10
max_restart_window_seconds = 60

# ChatRoom actor (Durable Object: one per room)
[[supervisor.children]]
id = "chat-room"
type = "worker"
restart = "permanent"
shutdown_timeout_seconds = 10

[supervisor.children.args]
max_history = "100"

# RateLimiter actor (Durable Object: per-user rate limiting)
[[supervisor.children]]
id = "rate-limiter"
type = "worker"
restart = "permanent"
shutdown_timeout_seconds = 5

[supervisor.children.args]
max_tokens = "5"
refill_rate_ms = "1000"

The runtime validates every WASM module against its declared WIT world, starts the supervision tree from the root down, and begins enforcing all security and resource policies — before your first actor processes its first message. PlexSpaces takes care of most cross cutting concerns like auth token validation, rate limiting, structured logging, trace context propagation, circuit breakers so that you can focus on the business logic.

Supervision and restarts. The manifest’s supervision tree is live. If actor crashes, the supervisor restarts it according to the declared strategy. If it exceeds max_restarts within max_restart_window_seconds, the supervisor escalates to its parent. This is exactly how Erlang/OTP gen_server supervision works.

Comparing Deployment Models

CapabilityTraditional MicroservicesAWS LambdaPlexSpaces App Server
Deployment unitContainer image per serviceFunction zip per LambdaSingle .psa bundle for entire app
SupervisionKubernetes restarts podsNoneErlang-style supervision tree
Auth enforcementAPI gateway / middlewareCustom authorizersRuntime-level, declarative in manifest
ObservabilityManual instrumentationCloudWatch + X-RayAuto-instrumented, zero actor code
Resource limitsContainer CPU/mem requestsTimeout + memory settingsPer-actor WASM-level enforcement
Multi-languagePer-container runtimesPer-function runtimesAll actors in one WASM runtime
StateExternal (Redis/DB)ExternalBuilt-in durable actor state
Cold startSeconds100ms–10s~50?s (WASM)

FaaS and Serverless

Here is where PlexSpaces bridges the worlds of actor systems and serverless platforms. Every actor you deploy in any language doubles as a serverless function that you invoke over plain HTTP. No client SDK required. No message queue setup. Just HTTP.

HTTP Invocation Model

PlexSpaces exposes a FaaS-style API that routes HTTP requests to actors using a simple URL pattern:

/api/v1/actors/{tenant}/{namespace}/{actor_type}

The HTTP method determines the invocation pattern:

HTTP MethodPatternBehavior
GETRequest-reply (ask)Sends query params as payload, waits for response
POSTUnicast message (tell)Sends JSON body, returns immediately
PUTUnicast message (tell)Same as POST, for update semantics
DELETERequest-reply (ask)Sends query params, waits for confirmation

FaaS in Action

This Rust example shows a FaaS-style webhook handler that receives HTTP POST payloads and stores delivery history — the kind of thing you would build on AWS Lambda or Cloudflare Workers, but here using PlexSpaces SDK annotations:

// Using PlexSpaces Rust SDK annotations (like Python @actor, @handler)
#[gen_server_actor]
struct WebhookHandler {
    deliveries: Vec<WebhookDelivery>,
    total_received: u64,
}

#[plexspaces_handlers]
impl WebhookHandler {
    #[handler("deliver")]
    async fn deliver(&mut self, ctx: &ActorContext, msg: &Message)
        -> Result<Value, BehaviorError>
    {
        let delivery = WebhookDelivery::new(
            ulid::Ulid::new().to_string(), &msg.payload,
        );
        self.deliveries.push(delivery);
        self.total_received += 1;
        Ok(json!({ "status": "received", "total": self.total_received }))
    }

    #[handler("list")]
    async fn list_deliveries(&self, _ctx: &ActorContext, _msg: &Message)
        -> Result<Value, BehaviorError>
    {
        Ok(json!({ "deliveries": self.deliveries, "total": self.total_received }))
    }
}

Invoke this actor over HTTP — no SDK, no message queue, just curl:

# POST a webhook delivery (fire-and-forget)
curl -X POST http://localhost:8001/api/v1/actors/acme-corp/webhooks/webhook_handler \
    -H "Content-Type: application/json" \
    -d '{"event": "order.completed", "order_id": "ORD-12345"}'

# GET recent deliveries (request-reply)
curl "http://localhost:8001/api/v1/actors/acme-corp/webhooks/webhook_handler?action=list"

Multi-Tenant Isolation

The URL path embeds tenant and namespace for built-in multi-tenant isolation. Tenant acme-corp cannot access tenant globex-inc‘s actors. The framework enforces this boundary at the routing layer with JWT-based authentication:

# Tenant A's rate limiter
curl -X POST http://localhost:8001/api/v1/actors/acme-corp/api/rate-limiter \
    -d '{"client_id": "user-123"}'

# Tenant B's rate limiter -- completely isolated state
curl -X POST http://localhost:8001/api/v1/actors/globex-inc/api/rate-limiter \
    -d '{"client_id": "user-456"}'

How PlexSpaces Compares to Traditional FaaS

The critical difference: PlexSpaces actors retain state between invocations. Traditional FaaS platforms treat functions as stateless — you manage state externally in DynamoDB, Redis, or S3. PlexSpaces actors carry durable state inside the actor, persisted via journaling and checkpointing. This eliminates the “stateless function + external state store” tax that adds latency and complexity to every serverless application.

CapabilityAWS LambdaCloudflare WorkersPlexSpaces FaaS
Cold start100ms-10s~5ms~50us (WASM)
StateExternal (DynamoDB)External (KV/D1)Built-in (durable actors)
PolyglotPer-runtime imagesJS/WASM onlyRust, Go, TS, Python on same runtime
CoordinationSQS/Step FunctionsDurable ObjectsTupleSpace, process groups, workflows
SupervisionNoneNoneErlang-style supervision trees
IsolationContainer/FirecrackerV8 isolatesWASM sandbox + optional Firecracker

PlexSpaces includes migration examples that show how to port existing Lambda functions, Step Functions workflows, Azure Durable Functions, Cloudflare Workers, and Orleans grains (See examples).


What WebAssembly Gives You

Let me address the obvious question: “Why not just use Docker?” Containers solve many problems well. But as Solomon Hykes, Docker’s creator, said in 2019 when WASI was first announced:

“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!” — Solomon Hykes, March 2019

WebAssembly solves some problems better:

  • Startup time. A WASM module instantiates in microseconds. A container takes seconds. When you auto-scale actors in response to load spikes, microsecond cold starts mean your users never notice.
  • Memory footprint. A Rust WASM actor uses ~200KB. The equivalent Docker container starts at 50MB minimum (Alpine base image alone). On a single node, you run thousands of WASM actors where you might run dozens of containers.
  • Security isolation. WASM sandboxing is capability-based. A module cannot access the filesystem, network, or memory outside its sandbox unless the host explicitly grants each capability through WASI. Containers share a kernel and rely on namespace isolation — a fundamentally larger attack surface.
  • True polyglot. With containers, each language gets its own image, runtime, dependency tree, and deployment pipeline. With WASM, all languages produce the same artifact type, run on the same runtime, and share the same deployment pipeline.
  • Composability. The Component Model lets you link WASM modules from different languages into a single process. No network calls. No serialization overhead. Direct function invocation across language boundaries. Try that with Docker.

PlexSpaces actually supports both: WASM sandboxing for lightweight actors and Firecracker microVMs for workloads that need full hardware-level isolation. You pick the isolation model per workload, and the framework handles the rest.


Where This Is Heading

The WASM Ecosystem Roadmap

The ecosystem moves fast. Here are the milestones that matter:

  • Wasm 3.0 became the W3C standard in September 2025, standardizing nine production features including WasmGC, exception handling, tail calls, and SIMD
  • WASI 0.3 shipped in February 2026 with native async support — actors can now handle concurrent I/O without blocking
  • WASI 1.0 is on track for late 2026 or early 2027, providing the stability guarantees that enterprise adopters require
  • Wasmtime leads the runtime ecosystem with full Component Model and WASI 0.2 support
  • Wasmer 6.0 achieved ~95% of native speed on benchmarks
  • Docker now runs WASM components alongside containers in Docker Desktop and Docker Engine

The FaaS-Actor Convergence

The most consequential trend is the convergence of serverless FaaS platforms and stateful actor systems. Today these exist as separate categories — AWS Lambda handles stateless functions, Temporal handles durable workflows, Orleans handles virtual actors, and Erlang/OTP handles fault-tolerant supervision. PlexSpaces unifies them into a single abstraction. This convergence accelerates along three axes:

  • HTTP-native invocation. Every PlexSpaces actor is already a serverless function, callable over HTTP with automatic routing, multi-tenant isolation, and load balancing. As the WASM ecosystem matures, the cold start advantage (microseconds vs. seconds) makes WASM actors a compelling replacement for traditional Lambda functions, especially at the edge.
  • Durable serverless. Traditional FaaS treats functions as stateless. PlexSpaces combines serverless invocation with durable execution — actors retain state, the framework journals every message, and crash recovery replays the journal to restore exact state. This eliminates the “Lambda + DynamoDB + Step Functions” stack that every non-trivial serverless application ends up building.
  • Edge-native polyglot. WASM runs everywhere like cloud servers, edge nodes, IoT devices, even browsers. PlexSpaces actors compiled to WASM deploy to any environment that runs wasmtime. A Python ML model runs at the edge. A Rust event processor runs in the cloud. A TypeScript API actor runs in the CDN. All three communicate through the same framework, sharing state through tuple spaces and coordinating through process groups.

Get Started

PlexSpaces is open source. Clone the repository and start building:

git clone https://github.com/bhatti/PlexSpaces.git
cd PlexSpaces

# Quick setup (installs tools, builds, tests)
./scripts/setup.sh

# Or use Docker for the fastest path
docker pull plexobject/plexspaces:latest
docker run -d -p 8000:8000 -p 8001:8001 \
    -e PLEXSPACES_NODE_ID=node1 \
    plexobject/plexspaces:latest

# Explore the examples
ls examples/python/apps/     # calculator, bank_account, chat_room, nbody, ...
ls examples/typescript/apps/  # bank_account, migrating_cloudflare_workers, migrating_orleans
ls examples/go/apps/          # migrating_erlang_otp, migrating_cloudflare_workers, ...
ls examples/rust/apps/        # calculator, nbody, session_manager, ...

# Build and test everything
make all

Each example includes its own app-config.toml, build.sh script, and test instructions. The examples/ directory also contains migration guides from 24+ frameworks like Erlang/OTP, Temporal, Ray, Cloudflare Workers, Orleans, Restate, Azure Durable Functions, AWS Step Functions, wasmCloud, Dapr, and more.

I spent decades wrestling with the same distributed systems problems under different names on different stacks (see my previous blog). Fault tolerance, state management, multi-language support, coordination, serverless invocation, scaling. These problems never change, only the acronyms do. WebAssembly makes the polyglot piece real. The Component Model makes it composable. The application server model makes it deployable in a way that finally lets you focus on what you actually came to write: business logic.


PlexSpaces is available at github.com/bhatti/PlexSpaces. Give it a try and let me know what you think.

February 9, 2026

Building PlexSpaces: Decades of Distributed Systems Distilled Into One Framework

Filed under: Agentic AI,Computing — admin @ 10:31 pm

I previously shared my experience with distributed systems over the last three decades that included IBM mainframes, BSD sockets, Sun RPC, CORBA, Java RMI, SOAP, Erlang actors, service meshes, gRPC, serverless functions, etc. Over the years, I kept solving the same problems in different languages, on different platforms, with different tooling. Each one of these frameworks taught me something essential but they also left something on the table. PlexSpaces pulls those lessons together into a single open-source framework: a polyglot application server that handles microservices, serverless functions, durable workflows, AI workloads, and high-performance computing using one unified actor abstraction. You write actors in Python, Rust, GO or TypeScript, compile them to WebAssembly, deploy them on-premises or in the cloud, and the framework handles persistence, fault tolerance, observability, and scaling. No service mesh. No vendor lock-in. Same binary on your laptop and in production.


Why Now?

Three things converged over the last few years that made this the right moment to build PlexSpaces:

  • WebAssembly matured. Though WebAssembly ecosystem is still evolving but WASI has stabilized enough to run real server workloads. Java promised “Write Once, Run Anywhere” — WASM actually delivers it. Docker’s creator Solomon Hykes captured it in 2019: “If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker.” Today that future has arrived.
  • AI agents exploded. Every AI agent is fundamentally an actor: it maintains state (conversation history), processes messages (user queries), calls tools (side effects), and needs fault tolerance (LLM APIs fail). The actor model maps naturally to agent orchestration but existing frameworks either lack durability, lock you to one language, or require separate infrastructure.
  • Multi-cloud pressure intensified. I’ve watched teams at multiple companies build on AWS in production but struggle to develop locally. Bugs surface only after deployment because Lambda, DynamoDB, and SQS behave differently from their local mocks/simulators. Modern enterprises need code that runs identically on a developer’s laptop, on-premises, and in any cloud.

PlexSpaces addresses all three: polyglot via WASM, actor-native for AI workloads, and local-first by design.


The Lessons That Shaped PlexSpaces

Every era of distributed computing burned a lesson into my thinking. Here’s what stuck and how I applied each lesson to PlexSpaces.

  • Efficiency runs deep: When I programmed BSD sockets in C, I controlled every byte on the wire. That taught me to respect the transport layer.
    Applied: PlexSpaces uses gRPC and Protocol Buffers for binary communication not because JSON is bad, but because high-throughput systems deserve binary protocols with proper schemas.
  • Contracts prevent chaos: Sun RPC introduced me to XDR and rpcgen for defining a contract, generate the code. CORBA reinforced this with IDL. I have seen countless times where teams sprinkle Swagger annotations on code and assumes that they have APIs, which keep growing without any standards, developer experience or consistency.
    Applied: PlexSpaces follows a proto-first philosophy – every API lives in Protocol Buffers, every contract generates typed stubs across languages (See OpenAPI specs for grpc/http services).
  • Parallelism needs multiple primitives: During my PhD research, I built JavaNow – a parallel computing framework that combined Linda-style tuple spaces, MPI collective operations, and actor-based concurrency on networks of workstations. That research taught me something frameworks keep forgetting: different coordination problems need different primitives. You can’t force everything through message passing alone.
    Applied: PlexSpaces provides actors and tuple spaces and channels and process groups because real systems need all of them.
  • Developer experience decides adoption: Java RMI made remote objects feel local. JINI added service discovery. Then J2EE and EJB buried developer hearts under XML configuration.
    Applied: PlexSpaces SDK provides decorator-based development (Python), inheritance-based development (TypeScript), and annotation-based development (Rust) to eliminate boilerplate.
  • Simplicity defeats complexity every time: With SOAP, WSDL, EJB, J2EE, I watched the Java enterprise ecosystem collapse under its own weight. REST won not because it was more powerful, but because it was simpler.
    Applied: One actor abstraction with composable capabilities beats a zoo of specialized types.
  • Cross-cutting concerns belong in the platform: Spring and AOP taught me to handle observability, security, and throttling consistently. But microservices in polyglot environments broke that model. Service meshes like Istio and Dapr tried to fix it with sidecar proxies but it requires another networking hop, another layer of YAML to debug.
    Applied: PlexSpaces bakes these concerns directly into the runtime. No service mesh. No extra hops.
  • Serverless is the right idea with the wrong execution: AWS Lambda showed me the future: auto-scaling, built-in observability, zero server management. But Lambda also showed me the problem: vendor lock-in, cold starts, and the inability to run locally.
    Applied: PlexSpaces delivers serverless semantics that run identically on your laptop and in the cloud.
  • Application servers got one thing right: Despite all the complexity of J2EE, I loved one idea: the application server that hosts multiple applications. You deployed WAR files to Tomcat, and it handled routing, lifecycle, and shared services. That model survived even after EJB died.
    Applied: PlexSpaces revives this concept for the polyglot serverless era where you can deploy Python ML models, TypeScript webhooks, and Rust performance-critical code to the same node.

I also built formicary, a framework for durable executions with graph-based workflow processing. That experience directly shaped PlexSpaces’ workflow and durability abstractions.


What PlexSpaces Actually Does

PlexSpaces combines five foundational pillars into a unified distributed computing platform:

  1. TupleSpace Coordination (Linda Model): Decouples producers and consumers through associative memory. Actors write tuples, read them by pattern, and never need to know who’s on the other side.
  2. Erlang/OTP Philosophy: Supervision trees restart failed actors. Behaviors define message-handling patterns.
  3. Durable Execution: Every actor operation gets journaled. When a node crashes, the framework replays the journal and restores state exactly. Side effects get cached during replay, so external calls don’t fire twice. Inspired by Restate and my earlier work on formicary.
  4. WASM Runtime: Actors compile to WebAssembly and run in a sandboxed environment. Python, TypeScript, Rust with same deployment model, same security guarantees.
  5. Firecracker Isolation: For workloads that need hardware-level isolation, PlexSpaces supports Firecracker microVMs alongside WASM sandboxing.

Core Abstractions: Actors, Behaviors, and Facets

One Actor to Rule Them All

PlexSpaces follows a design principle I arrived at after years of watching frameworks proliferate actor types: one powerful abstraction with composable capabilities beats multiple specialized types. Every actor in PlexSpaces maintains private state, processes messages sequentially (eliminating race conditions), operates transparently across local and remote boundaries, and recovers automatically through supervision.

Actor Lifecycle

Actors move through a well-defined lifecycle — one of the details that distinguishes PlexSpaces from simpler actor frameworks:

PlexSpaces supports Virtual actors (with VirtualActorFacet inspired by Orleans Actor Model) leverage this lifecycle automatically, which activate on first message, deactivate after idle timeout, and reactivate transparently on the next message. No manual lifecycle management.

Tell vs Ask: Two Message Patterns

PlexSpaces supports two fundamental communication patterns:

  • Tell (asynchronous): The sender dispatches a message and moves on. Use this for events, notifications, and one-way commands.
  • Ask (request-reply): The sender dispatches a request and waits for a response with a timeout. Use this for queries and operations that need confirmation.
from plexspaces import actor, handler, host

@actor
class OrderService:
    @handler("place_order")
    def place_order(self, order: dict) -> dict:
        # Tell: fire-and-forget notification to analytics
        host.tell("analytics-actor", "order_placed", order)
        
        # Ask: request-reply to inventory service (5s timeout)
        inventory = host.ask("inventory-actor", "check_stock", 
                            {"sku": order["sku"]}, timeout_ms=5000)
        
        if inventory["available"]:
            return {"status": "confirmed", "order_id": order["id"]}
        return {"status": "out_of_stock"}

Behaviors: Compile-Time Patterns

Behaviors define how an actor processes messages. You choose a behavior at compile time:

BehaviorAnnotationPatternBest For
Default@actorMessage-basedGeneral purpose
GenServer@gen_server_actorRequest-replyStateful services, CRUD
GenEvent@event_actorFire-and-forgetEvent processing, logging
GenFSM@fsm_actorState machineOrder processing, approval flows
Workflow@workflow_actorDurable orchestrationLong-running processes

Facets: Runtime Capabilities

Facets attach dynamic capabilities to actors without changing the actor type. I wrote about the pattern of dynamic facets and runtime composition previously. This allows adding dynamic behavior through facets, combined with Erlang’s static behavior model. Think of facets as middleware that wraps your actor. They execute in priority order like security facets fire first, then logging, then metrics, then your business logic, then persistence:

Available facets include:

  • Infrastructure: VirtualActorFacet (Orleans-style auto-activation), DurabilityFacet (persistence + replay), MobilityFacet (actor migration)
  • Storage: KeyValueFacet, BlobStorageFacet, LockFacet
  • Communication: ProcessGroupFacet (Erlang pg2-style groups), RegistryFacet
  • Scheduling: TimerFacet (transient), ReminderFacet (durable)
  • Observability: MetricsFacet, TracingFacet, LoggingFacet
  • Security: AuthenticationFacet, AuthorizationFacet
  • Events: EventEmitterFacet (reactive patterns)

Facets compose freely, e.g., add facets=["durability", "timer", "metrics"] and your actor gains persistence, scheduled execution, and Prometheus metrics with zero additional code.

Custom Facets: Extending the Framework

The facet system opens for extension. You can build domain-specific facets and register them with the framework:

use plexspaces_core::{Facet, FacetError, InterceptResult};

pub struct FraudDetectionFacet {
    threshold: f64,
}

#[async_trait]
impl Facet for FraudDetectionFacet {
    fn name(&self) -> &str { "fraud_detection" }
    fn priority(&self) -> u32 { 200 } // Run after security, before domain logic

    async fn before_method(
        &mut self, method: &str, payload: &[u8]
    ) -> Result<InterceptResult, FacetError> {
        let score = self.score_transaction(payload).await?;
        if score > self.threshold {
            return Err(FacetError::Custom("fraud_detected".into()));
        }
        Ok(InterceptResult::Continue)
    }
}

Register it once, attach it to any actor by name. This extensibility distinguishes PlexSpaces from frameworks with fixed capability sets.


Hands-On: Building Actors in Three Languages

Let me show you how PlexSpaces works in practice across all three SDKs.

Python: Decorator-Based Development

from plexspaces import actor, state, handler

@actor
class CounterActor:
    count: int = state(default=0)

    @handler("increment")
    def increment(self, amount: int = 1) -> dict:
        self.count += amount
        return {"count": self.count}  # => {"count": 5}

    @handler("get")
    def get(self) -> dict:
        return {"count": self.count}  # => {"count": 5}

The SDK eliminates over 100 lines of WASM boilerplate. You declare state with state(), mark handlers with @handler, and return dictionaries. The framework handles serialization, lifecycle, and state management.

TypeScript: Inheritance-Based Development

import { PlexSpacesActor } from "@plexspaces/sdk";

interface CounterState { count: number; }

export class CounterActor extends PlexSpacesActor<CounterState> {
  getDefaultState(): CounterState { return { count: 0 }; }

  onIncrement(payload: Record<string, unknown>) {
    const amount = Number(payload.amount ?? 1);
    this.state.count += amount;
    return { count: this.state.count };  // => {"count": 5}
  }

  onGet() { return { count: this.state.count }; }
}

Rust: Annotation-Based Development

use plexspaces_sdk::{gen_server_actor, plexspaces_handlers, handler, json};

#[gen_server_actor]
struct Counter { count: i32 }

#[plexspaces_handlers]
impl Counter {
    #[handler("increment")]
    async fn increment(&mut self, _ctx: &ActorContext, msg: &Message)
        -> Result<serde_json::Value, BehaviorError> {
        let payload: serde_json::Value = serde_json::from_slice(&msg.payload)?;
        self.count += payload["amount"].as_i64().unwrap_or(1) as i32;
        Ok(json!({ "count": self.count }))  // => {"count": 5}
    }
}

Building, Deploying, and Invoking

# Build Python actor to WebAssembly
plexspaces-py build counter_actor.py -o counter.wasm

# Deploy to a running node
curl -X POST http://localhost:8094/api/v1/deploy \
  -F "namespace=default" \
  -F "actor_type=counter" \
  -F "wasm=@counter.wasm"

# Invoke via HTTP — FaaS-style (POST = tell, GET = ask)
curl -X POST "http://localhost:8080/api/v1/actors/default/default/counter" \
  -H "Content-Type: application/json" \
  -d '{"action":"increment","amount":5}'

# Request-reply on GET
curl "http://localhost:8080/api/v1/actors/default/default/counter" \
  -H "Content-Type: application/json"
# => {"count": 5}

That’s it. No Kubernetes manifests. No Terraform. No sidecar containers. Deploy a WASM module, invoke it over HTTP. The same endpoint works as an AWS Lambda Function URL.


Durable Execution: Crash and Recover Without Losing State

Durable execution solves a problem I’ve encountered at every company I’ve worked for: what happens when a node crashes mid-operation?

PlexSpaces journals every actor operation, when messages received, side effects executed, state changes applied. When a node crashes and restarts, the framework loads the latest checkpoint and replays journal entries from that point. Side effects return cached results during replay, so external API calls don’t fire twice.

Example: A Durable Bank Account

from plexspaces import actor, state, handler

@actor(facets=["durability"])
class BankAccount:
    balance: int = state(default=0)
    transactions: list = state(default_factory=list)

    @handler("deposit")
    def deposit(self, amount: int = 0) -> dict:
        self.balance += amount
        self.transactions.append({
            "type": "deposit", "amount": amount,
            "balance_after": self.balance
        })
        return {"status": "ok", "balance": self.balance}

    @handler("withdraw")
    def withdraw(self, amount: int = 0) -> dict:
        if amount > self.balance:
            return {"status": "insufficient_funds", "balance": self.balance}
        self.balance -= amount
        self.transactions.append({
            "type": "withdraw", "amount": amount,
            "balance_after": self.balance
        })
        return {"status": "ok", "balance": self.balance}

    @handler("replay")
    def replay_transactions(self) -> dict:
        """Rebuild balance from transaction log to verify consistency."""
        rebuilt = 0
        for tx in self.transactions:
            rebuilt += tx["amount"] if tx["type"] == "deposit" else -tx["amount"]
        return {
            "replayed": len(self.transactions),
            "rebuilt_balance": rebuilt,
            "current_balance": self.balance,
            "consistent": rebuilt == self.balance
        }

Adding facets=["durability"] activates journaling and checkpointing. If the node crashes after processing ten deposits, the framework restores all ten sono data loss, no duplicate charges. Periodic checkpoints accelerate recovery by 90%+ and the framework loads the latest snapshot and replays only recent entries.


Data-Parallel Actors: Worker Pools and Scatter-Gather

When I built JavaNow during my PhD, I implemented MPI-style scatter-gather and parallel map operations. PlexSpaces brings these patterns to production through ShardGroups adata-parallel actor pools inspired by the DPA paper. A ShardGroup partitions data across multiple actor shards and supports three core operations:

  • Bulk Update: Routes writes to the correct shard based on a partition key (hash, consistent hash, or range)
  • Parallel Map: Queries all shards simultaneously and collects results
  • Scatter-Gather: Broadcasts a query and aggregates responses with fault tolerance

Example: Data-Parallel Worker Pool with Scatter-Gather

This pattern comes from the PlexSpaces examples. Each worker actor in the ShardGroup holds a partition of state and processes tasks independently and the framework handles routing, fan-out, and aggregation:

#[gen_server_actor]
pub struct WorkerActor {
    worker_id: String,
    state: Arc<RwLock<HashMap<String, Value>>>,
    tasks_processed: u64,
    total_processing_time_ms: u64,
}

#[plexspaces_handlers]
impl WorkerActor {
    #[handler("*")]
    async fn process(&mut self, _ctx: &ActorContext, msg: &Message)
        -> Result<Value, BehaviorError> {
        let payload: Value = serde_json::from_slice(&msg.payload)?;
        match payload["action"].as_str().unwrap_or("unknown") {
            "set" => {
                let key = payload["key"].as_str().unwrap_or("default");
                self.state.write().await.insert(key.to_string(), payload["value"].clone());
                self.tasks_processed += 1;
                Ok(json!({ "action": "set", "key": key, "worker_id": self.worker_id }))
            }
            "get_total_count" => {
                let state = self.state.read().await;
                let total: u64 = state.values().filter_map(|v| v.as_u64()).sum();
                Ok(json!({
                    "total": total, "worker_id": self.worker_id,
                    "keys_processed": state.len()
                }))
            }
            "stats" => {
                let avg_time = if self.tasks_processed > 0 {
                    self.total_processing_time_ms / self.tasks_processed
                } else { 0 };
                Ok(json!({
                    "worker_id": self.worker_id,
                    "tasks_processed": self.tasks_processed,
                    "avg_processing_time_ms": avg_time,
                    "keys_in_state": self.state.read().await.len()
                }))
            }
            _ => Err(BehaviorError::ProcessingError(format!("Unknown action")))
        }
    }
}

The #[handler("*")] wildcard routes all messages to a single dispatch method — the worker decides what to do based on the action field. Each worker tracks its own processing statistics, so you can identify hot shards or slow workers.

The orchestration code shows all three data-parallel operations in sequence including bulk update, parallel map, and parallel reduce:

// Create a pool of 20 workers with hash-based partitioning
let pool_id = client.create_worker_pool(
    "worker-pool-1", "worker", 20,
    PartitionStrategy::PartitionStrategyHash,
    HashMap::new(),
).await?;

// Bulk update: route 10,000 messages to the right shard by key
let mut updates = HashMap::new();
for i in 0..10_000 {
    let key = format!("key-{:05}", i);
    updates.insert(key.clone(), json!({ "action": "set", "key": key, "value": i }));
}
client.parallel_update(&pool_id, updates,
    ConsistencyLevel::ConsistencyLevelEventual, false).await?;

// Parallel map: query every worker simultaneously
let results = client.parallel_map(&pool_id,
    json!({ "action": "get_total_count" })).await?;
// => 20 responses, one per worker, each with its partition's total

// Parallel reduce: aggregate stats across all workers
let stats = client.parallel_reduce(&pool_id,
    json!({ "action": "stats" }),
    ShardGroupAggregationStrategy::ShardGroupAggregationConcat, 20).await?;
// => Combined stats: tasks_processed, avg_processing_time_ms per worker

parallel_update routes each key to its shard via consistent hashing: 10,000 messages fan out across 20 workers without the caller managing any routing logic. parallel_map broadcasts a query to every shard and collects results. parallel_reduce does the same but aggregates the responses using a configurable strategy (concat, sum, merge). This maps directly to distributed ML (partition model parameters across shards, push gradient updates through parallel_update, collect the full parameter set via parallel_map) or any workload that benefits from partitioned state with scatter-gather queries.


TupleSpace: Linda’s Associative Memory for Coordination

During my PhD work on JavaNow, I was blown away by the simplicity of Linda’s tuple space model for writing data flow based applications for coordination with different actors. The actors communicate through direct message passing, tuple spaces provide associative shared memory where producers write tuples, consumers read or take them with blocking or non-blocking patterns. This decouples components in three dimensions: spatial (actors don’t need references to each other), temporal (producers and consumers don’t need to run simultaneously), and pattern-based (consumers retrieve data by structure, not by address).

from plexspaces import actor, handler, host
import json

@actor
class OrderProducer:
    @handler("create_order")
    def create_order(self, order_id: str, items: list) -> dict:
        # Write a tuple — any consumer can pick it up
        host.ts_write(json.dumps(["order", order_id, "pending", items]))
        return {"status": "created", "order_id": order_id}

@actor
class OrderProcessor:
    @handler("process_next")
    def process_next(self) -> dict:
        # Take the next pending order (destructive read — removes from space)
        pattern = json.dumps(["order", None, "pending", None])  # Wildcards
        result = host.ts_take(pattern)
        if result:
            data = json.loads(result)
            order_id = data[1]
            # Process order, then write completion tuple
            host.ts_write(json.dumps(["order", order_id, "completed", data[3]]))
            return {"processed": order_id}
        return {"status": "no_pending_orders"}

I use TupleSpace heavily for dataflow pipelines: each stage writes results as tuples, and downstream stages pick them up by pattern. Stages can run at different speeds, on different nodes, in different languages. The tuple space absorbs the mismatch.


Batteries Included: Everything You Need, Built In

At every company I’ve worked at, the first three months after adopting a framework go to integrating storage, messaging, and locks. PlexSpaces ships all of these as built-in services in the same codebase, no extra infrastructure, no service mesh.

What’s in the Box

ServiceBackendsWhat It Does
Key-Value StoreSQLite, PostgreSQL, Redis, DynamoDBDistributed KV storage with TTL
Blob StorageMinIO/S3, GCS, Azure BlobLarge object storage with presigned URLs
Distributed LocksSQLite, PostgreSQL, Redis, DynamoDBLease-based mutual exclusion
Process GroupsBuilt-inErlang pg2-style group messaging and pub/sub
ChannelsInMemory, Redis, Kafka, NATS, SQS, SQLite, UDPQueue and topic messaging
Object RegistrySQLite, PostgreSQL, DynamoDBService discovery with TTL + gossip
ObservabilityBuilt-inMetrics (Prometheus), tracing (OpenTelemetry), structured logging
SecurityBuilt-inJWT auth (HTTP), mTLS (gRPC), tenant isolation, secret masking

PlexSpaces uses adapters pattern to plug different implementation of channels, object-registry, tuple-space based on config. For example, PlexSpaces auto-selects the best available backend for channel using a priority chain and availability (Kafka -> SQS -> NATS -> ProcessGroup -> UDP Multicast -> InMemory). Start developing with in-memory channels, deploy to production with Kafka without code changes. Actors using non-memory channels also support graceful shutdown: they stop accepting new messages but complete in-progress work.

Multi-Tenancy: Enterprise-Grade Isolation

PlexSpaces enforces two-level tenant isolation. The tenant_id comes from JWT tokens (HTTP) or mTLS certificates (gRPC). The namespace provides sub-tenant isolation for environments/applications. All queries filter by tenant automatically at the repository layer. This gives you secure multi-tenant deployments without trusting application code to enforce boundaries.

Example: Payment Processing with Built-In Services

from plexspaces import actor, handler, host

@actor(facets=["durability", "metrics"])
class PaymentProcessor:
    @handler("process_refund")
    def process_refund(self, tx_id: str, amount: int) -> dict:
        # Distributed lock prevents duplicate refunds
        lock_version = host.lock_acquire(f"refund:{tx_id}", 5000)
        if not lock_version:
            return {"error": "could_not_acquire_lock"}

        try:
            # Store refund record in built-in key-value store
            host.kv_put(f"refund:{tx_id}", json.dumps({
                "amount": amount, "status": "processed"
            }))
            return {"status": "refunded", "amount": amount}
        finally:
            host.lock_release(f"refund:{tx_id}", lock_version)

No Redis cluster to manage. No DynamoDB table to provision. The framework handles it.

Process Groups: Erlang pg2-Style Communication

Process groups provide distributed pub/sub and group messaging, which is one of Erlang’s most powerful patterns. Here’s a chat room that demonstrates joining, broadcasting, and member queries:

from plexspaces import actor, handler, host

@actor
class ChatRoom:
    @handler("join")
    def join_room(self, room_name: str) -> dict:
        actor_id = host.get_actor_id()
        host.process_groups.join(room_name, actor_id)
        return {"status": "joined", "room": room_name}

    @handler("send")
    def send_message(self, room_name: str, text: str) -> dict:
        host.process_groups.publish(room_name, {"text": text})
        return {"status": "sent"}

    @handler("members")
    def get_members(self, room_name: str) -> dict:
        members = host.process_groups.get_members(room_name)
        return {"room": room_name, "members": members}

Groups support topic-based subscriptions within groups and scope automatically by tenant_id and namespace.


Polyglot Development: One Server, Many Languages

A single PlexSpaces node hosts actors written in different languages simultaneously: Python ML models, TypeScript webhook handlers, and Rust performance-critical paths sharing the same actor runtime, storage services, and observability stack:

Same WASM module deploys anywhere: no Docker images, no container registries, no “it works on my machine”:

# Build and deploy to on-premises
plexspaces-py build ml_model.py -o ml_model.wasm
curl -X POST http://on-prem:8094/api/v1/deploy \
  -F "namespace=prod" -F "actor_type=ml_model" -F "wasm=@ml_model.wasm"

# Deploy to cloud — same command, same binary
curl -X POST http://cloud:8094/api/v1/deploy \
  -F "namespace=prod" -F "actor_type=ml_model" -F "wasm=@ml_model.wasm"

Common Patterns

Over three decades, I’ve watched the same architectural patterns emerge at every company and every scale. PlexSpaces supports the most important ones natively.

Durable Workflows with Signals and Queries

Long-running processes with automatic recovery, external signals, and read-only queries — think order fulfillment, onboarding flows, or CI/CD pipelines:

from plexspaces import workflow_actor, state, run_handler, signal_handler, query_handler

@workflow_actor(facets=["durability"])
class OrderWorkflow:
    order_id: str = state(default="")
    status: str = state(default="pending")
    steps_completed: list = state(default_factory=list)

    @run_handler
    def run(self, input_data: dict) -> dict:
        """Main execution — exclusive, one at a time."""
        self.order_id = input_data.get("order_id", "")
        self.status = "validating"
        self.steps_completed.append("validation")
        self.status = "charging"
        self.steps_completed.append("payment")
        self.status = "shipping"
        self.steps_completed.append("shipment")
        self.status = "completed"
        return {"status": "completed", "order_id": self.order_id}

    @signal_handler("cancel")
    def on_cancel(self, data: dict) -> None:
        """External signals can alter workflow state."""
        self.status = "cancelled"

    @query_handler("status")
    def get_status(self) -> dict:
        """Read-only queries can run concurrently with execution."""
        return {"order_id": self.order_id, "status": self.status,
                "steps": self.steps_completed}

Staged Event-Driven Architecture (SEDA)

Chain processing stages through channels. Each stage runs at its own pace, and channels provide natural backpressure:

Leader Election

Distributed locks elect a leader with lease-based failover. The leader holds a lock and renews it periodically. If the leader crashes, the lease expires and another candidate acquires leadership:

@actor
class LeaderElection:
    candidate_id: str = state(default="")
    lock_version: str = state(default="")

    @handler("try_lead")
    def try_lead(self, candidate_id: str = None) -> dict:
        holder_id = candidate_id or self.candidate_id
        result = host.lock_acquire("", "leader-election", holder_id, "leader", 30, 0)
        if result and not result.startswith("ERROR"):
            self.lock_version = json.loads(result).get("version", result)
            return {"leader": True, "candidate_id": holder_id}
        return {"leader": False}

Resource-Based Affinity

Label actors with hardware requirements (gpu: true, memory: high) and PlexSpaces schedules them on matching nodes. This maps naturally to ML training pipelines where different stages need different hardware.

Cellular Architecture

PlexSpaces organizes nodes into cells using the SWIM protocol (gossip-based node discovery). Cells provide fault isolation, geographic distribution, and low-latency routing to the nearest cell. Nodes within a cell share channels via the cluster_name configuration, enabling UDP multicast for low-latency cluster-wide messaging.


How PlexSpaces Compares

PlexSpaces doesn’t replace any single framework, it unifies patterns from many. Here’s what it borrows from each, and what limitation of each it addresses:

FrameworkWhat PlexSpaces BorrowsLimitation PlexSpaces Addresses
Erlang/OTPGenServer, supervision, “let it crash”BEAM-only; no polyglot WASM
AkkaActor model, message passingNo longer open source; JVM-only
OrleansVirtual actors, grain lifecycle.NET-only; no tuple spaces or HPC
TemporalDurable workflows, replayRequires separate server infrastructure
RestateDurable execution, journalingNo full actor model; no HPC patterns
RayDistributed ML, parameter serversPython-centric; no durable execution
AWS LambdaServerless invocation, auto-scalingVendor lock-in; no local dev parity
Azure Durable FunctionsDurable orchestrationAzure-only; limited language support
Golem CloudWASM-based durabilityNo built-in storage/messaging/locks
DaprSidecar service mesh, virtual actorsExtra networking hop; state management limits

Key Differentiators

  • No service mesh: Built-in observability, security, and throttling eliminate the extra networking hop
  • Local-first: Same code runs on your laptop and in production. No cloud-only surprises.
  • Polyglot via WASM: Write actors in Python, Rust, TypeScript. Same deployment model.
  • Batteries included: KV store, blob storage, locks, channels, process groups — all built in
  • One abstraction: Composable facets on a unified actor, not a zoo of specialized types
  • Application server model: Deploy multiple polyglot applications to a single node
  • Research-grade + production-ready: Linda tuple spaces, MPI patterns, and Erlang supervision in a single framework

Getting Started

Install and Run

# Docker (fastest)
docker run -p 8080:8080 -p 8000:8000 -p 8001:8001 plexobject/plexspaces:latest

# From source
git clone https://github.com/bhatti/PlexSpaces.git
cd PlexSpaces && make build

Write -> Build -> Deploy -> Invoke

# greeter.py
from plexspaces import actor, state, handler

@actor
class GreeterActor:
    greetings_count: int = state(default=0)

    @handler("greet")
    def greet(self, name: str = "World") -> dict:
        self.greetings_count += 1
        return {"message": f"Hello, {name}!", "total": self.greetings_count}
plexspaces-py build greeter.py -o greeter.wasm
curl -X POST http://localhost:8094/api/v1/deploy \
  -F "namespace=default" -F "actor_type=greeter" -F "wasm=@greeter.wasm"
curl -X POST "http://localhost:8080/api/v1/actors/default/default/greeter?invocation=call" \
  -H "Content-Type: application/json" -d '{"action":"greet","name":"PlexSpaces"}'
# => {"message": "Hello, PlexSpaces!", "total": 1}

Explore more in the examples directory: bank accounts with durability, task queues with distributed locks, leader election, chat rooms with process groups, and more.


Lessons Learned

After decades of distributed systems, I keep returning to the same truths:

  • Efficiency matters. Respect the transport layer. Binary protocols with schemas outperform JSON for high-throughput systems.
  • Contracts prevent chaos. Define APIs before implementations. Generate code from schemas.
  • Simplicity defeats complexity. Every framework that collapsed like EJB, SOAP, CORBA did under the weight of accidental complexity. One powerful abstraction beats ten specialized ones.
  • Developer experience decides adoption. If your framework requires 100 lines of boilerplate for a counter, developers will choose the one that needs 15.
  • Local and production must match. Every bug I’ve seen that “only happens in production” stemmed from environmental differences.
  • Cross-cutting concerns belong in the platform. Scatter them across codebases and you get inconsistency. Centralize them in a service mesh and you get latency. Build them in.
  • Multiple coordination primitives solve multiple problems. Actors handle request-reply. Channels handle pub/sub. Tuple spaces handle coordination. Process groups handle broadcast. Real systems need all of them.

The distributed systems landscape keeps changing as WASM is maturing, AI agents are creating new coordination challenges, and enterprises are pushing back on vendor lock-in harder than ever. I believe the next generation of frameworks will converge on the patterns PlexSpaces brings together: polyglot runtimes, durable actors, built-in infrastructure, and local-first deployment. PlexSpaces distills years of lessons into a single framework. It’s the framework I wished existed at every company I’ve worked for that handles the infrastructure so I can focus on the problem.


PlexSpaces is open source at github.com/bhatti/PlexSpaces. Try the counter example and provide your feedback.

December 11, 2025

50+ Languages in Three Decades

Filed under: Computing — admin @ 10:53 am

Here are the programming languages I’ve used over the last three decades. From BASIC in the late 80s to Rust today, each one taught me something about solving problems with code.


Late 1980s – Early 1990s

  • I learned coding with BASIC/QUICKBASIC on Atari and later IBM XT computer in 1980s.
  • I learned other languages in college or on my own like C, Pascal, Prolog, Lisp, FORTRAN and PERL.
  • In college, I used Icon to build compilers.
  • My first job was mainframe work and I used COBOL and CICS for applications, JCL and REXX for scripting and SAS for data processing.
  • Later at a physics lab, I used C/C++, Fortran for applications and Python for scripting and glue language.
  • I used a number of 4GL languages like dBase, FoxPro, Paradox, Delphi. Later I used Visual Basic and PowerBuilder for building client applications.
  • I used SQL and PL/SQL throughout my career for relational databases.

Web Era Mid/Late 1990s

  • The web era introduced a number of new languages like HTML, Javascript, CSS, ColdFusion, and Java.
  • I used XML/XSLT/XPath/XQuery, PHP, VBScript and ActionScript.
  • I used RSS/SPARQL/RDF for buiding semantic web applications.
  • I used IDL/CORBA for building distributed systems.

Mobile/Services Era 2000s

  • I used Ruby for building web applications, Erlang/Elixir for building concurrent applications.
  • I used Groovy for writing tests and R for data analysis.
  • When iOS was released, I used Objective-C to build mobile applications.
  • In this era, functional languages gained popularity and I used Scala/Haskell/Clojure for some projects.

New Languages Era Mid 2010s

  • I started using Go for networking/concurrent applications.
  • I started using Swift for iOS applications and Kotlin for Android apps.
  • I initially used Flow language from Facebook but then started using TypeScript instead of JavaScript.
  • I used Dart for Flutter applications.
  • I used GraphQL for some of client friendly backend APIs.
  • I used Solidity for Ethereum smart contracts.
  • I used Lua as a glue language with Redis, HAProxy and other similar systems.
  • I used Rust and became my go to language for highly performant applications.

What Three Decades of Languages Taught Me

  1. Every language is a bet on what matters most: Safety vs. speed vs. expressiveness vs. ecosystem vs. hiring.
  2. Languages don’t die, they fade: I still see COBOL in production. I still debug Perl scripts. Legacy is measured in decades.
  3. The fundamentals never change: Whether it’s BASIC or Rust, you’re still managing state, controlling flow, and abstracting complexity.
  4. Polyglotism is a superpower: Each language teaches you a different way to think. Functional programming makes you better at OOP. Systems programming makes you better at scripting.
  5. The best language is the one your team can maintain: I’ve seen beautiful Scala codebases become liabilities and ugly PHP applications become billion-dollar businesses.

What’s Next?

I’m watching Zig (Rust without the complexity?) and it’s on my list for next language to learn.

Summary

#LanguageEraPrimary Use
1BASIC/QuickBASICLate 1980sLearning
2CLate 1980s – Early 1990sCollege/Applications
3PascalLate 1980s – Early 1990sCollege
4PrologLate 1980s – Early 1990sCollege
5LispLate 1980s – Early 1990sCollege
6FORTRANLate 1980s – Early 1990sCollege/Physics Lab
7PerlLate 1980s – Early 1990sSelf-taught
8IconLate 1980s – Early 1990sCompiler Building
9COBOLEarly 1990sMainframe Applications
10CICSEarly 1990sMainframe Applications
11JCLEarly 1990sMainframe Scripting
12REXXEarly 1990sMainframe Scripting
13SASEarly 1990sData Processing
14C++Early 1990sPhysics Lab Applications
15PythonEarly 1990sScripting/Glue Language
16dBaseEarly 1990s4GL Applications
17FoxProEarly 1990s4GL Applications
18ParadoxEarly 1990s4GL Applications
19DelphiEarly 1990s4GL Applications
20Visual BasicEarly 1990sClient Applications
21PowerBuilderEarly 1990sClient Applications
22SQLEarly 1990s – PresentRelational Databases
23PL/SQLEarly 1990s – PresentRelational Databases
24HTMLMid/Late 1990sWeb Development
25JavaScriptMid/Late 1990sWeb Development
26CSSMid/Late 1990sWeb Styling
27ColdFusionMid/Late 1990sWeb Development
28JavaMid/Late 1990sWeb Development
29XMLMid/Late 1990sData/Configuration
30XSLTMid/Late 1990sData Transformation
31XPathMid/Late 1990sXML Querying
32XQueryMid/Late 1990sXML Querying
33PHPMid/Late 1990sWeb Development
34VBScriptMid/Late 1990sScripting
35ActionScriptMid/Late 1990sFlash Development
36RSSMid/Late 1990sSemantic Web
37SPARQLMid/Late 1990sSemantic Web
38RDFMid/Late 1990sSemantic Web
39IDLMid/Late 1990sDistributed Systems
40CORBAMid/Late 1990sDistributed Systems
41Ruby2000sWeb Applications
42Erlang2000sConcurrent Applications
43Elixir2000sConcurrent Applications
44Groovy2000sTesting
45R2000sData Analysis
46Objective-C2000siOS Applications
47Scala2000sFunctional Programming
48Haskell2000sFunctional Programming
49Clojure2000sFunctional Programming
50GoMid 2010sNetworking/Concurrent Apps
51SwiftMid 2010siOS Applications
52KotlinMid 2010sAndroid Applications
53FlowMid 2010sType Checking
54TypeScriptMid 2010sJavaScript Alternative
55DartMid 2010sFlutter Applications
56GraphQLMid 2010sBackend APIs
57SolidityMid 2010sSmart Contracts
58LuaMid 2010sGlue Language
59RustMid 2010s – PresentHigh Performance Apps

December 3, 2025

Building Production-Grade AI Agents with MCP & A2A: A Complete Guide from the Trenches

Filed under: Computing,Uncategorized — admin @ 12:36 pm

Problem Statement

I’ve spent the last year building AI agents in enterprise environments. During this time, I’ve extensively applied emerging standards like Model Context Protocol (MCP) from Anthropic and the more recent Agent-to-Agent (A2A) Protocol for agent communication and coordination. What I’ve learned: there’s a massive gap between building a quick proof-of-concept with these protocols and deploying a production-grade system. The concerns that get overlooked in production deployments are exactly what will take you down at 3 AM:

  • Multi-tenant isolation with row-level security (because one leaked document = lawsuit)
  • JWT-based authentication across microservices (no shared sessions, fully stateless)
  • Real-time observability of agent actions (when agents misbehave, you need to know WHY)
  • Cost tracking and budgeting per user and model (because OpenAI bills compound FAST)
  • Hybrid search combining BM25 and vector embeddings (keyword matching + semantic understanding)
  • Graceful degradation when embeddings aren’t available (real data is messy)
  • Integration testing against real databases (mocks lie to you)

Disregarding security concerns can lead to incidents like the Salesloft breach where their AI chatbot inadvertently stored authentication tokens for hundreds of services, which exposed customer data across multiple platforms. More recently in October 2025, Filevine (a billion-dollar legal AI platform) exposed 100,000+ confidential legal documents through an unauthenticated API endpoint that returned full admin tokens to their Box filesystem. No authentication required, just a simple API call. I’ve personally witnessed security issues from inadequate AuthN/AuthZ controls and cost overruns exceeding hundreds of thousands of dollars, which are preventable with proper security and budget enforcement.

The good news is that MCP and A2A protocols provide the foundation to solve these problems. Most articles treat these as competing standards but they are complementary. In this guide, I’ll show you exactly how to combine MCP and A2A to build a system that handles real production concerns: multi-tenancy, authentication, cost control, and observability.

Reference Implementation

To demonstrate these concepts in action, I’ve built a reference implementation that showcases production-ready patterns.

Architecture Philosophy:

Three principles guided every decision:

  1. Go for servers, Python for workflows – Use the right tool for each job. Go handles high-throughput protocol servers. Python handles AI workflows.
  2. Database-level security – Multi-tenancy enforced via PostgreSQL row-level security (RLS), not application code. Impossible to bypass accidentally.
  3. Stateless everything – Every service can scale horizontally. No sticky sessions, no shared state, no single points of failure.

All containerized, fully tested, and ready for production deployment.

Tech Stack Summary:

  • Go 1.22 (protocol servers)
  • Python 3.11 (AI workflows)
  • PostgreSQL 16 + pgvector (vector search with RLS)
  • Ollama (local LLM)
  • Docker Compose (local development)
  • Kubernetes manifests (production deployment)

GitHub: Complete implementation available

But before we dive into the implementation, let’s understand the fundamental problem these protocols solve and why you need both.


Part 1: Understanding MCP and A2A

The Core Problem: Integration Chaos

Prior to MCP protocol in 2024, you had to build custom integration with LLM providers, data sources and AI frameworks. Every AI application had to reinvent authentication, data access, and orchestration, which doesn’t scale. MCP and A2A emerged to solve different aspects of this chaos:

The MCP Side: Standardized Tool Execution

Think of MCP as a standardized toolbox for AI models. Instead of every AI application writing custom integrations for databases, APIs, and file systems, MCP provides a JSON-RPC 2.0 protocol that models use to:

  • Call tools (search documents, retrieve data, update records)
  • Access resources (files, databases, APIs)
  • Send prompts (inject context into model calls)

From the MCP vs A2A comparison:

“MCP excels at synchronous, stateless tool execution. It’s perfect when you need an AI model to retrieve information, execute a function, and return results immediately.”

Here’s what MCP looks like in practice:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "hybrid_search",
    "arguments": {
      "query": "machine learning best practices",
      "limit": 5,
      "bm25_weight": 0.5,
      "vector_weight": 0.5
    }
  }
}

The server executes the tool and returns results. Simple, stateless, fast.

Why JSON-RPC 2.0? Because it’s:

  • Language-agnostic – Works with any language that speaks HTTP
  • Batch-capable – Multiple requests in one HTTP call
  • Error-standardized – Consistent error codes across implementations
  • Widely adopted – 20+ years of production battle-testing

The A2A Side: Stateful Workflow Orchestration

A2A handles what MCP doesn’t: multi-step, stateful workflows where agents collaborate. From the A2A Protocol docs:

“A2A is designed for asynchronous, stateful orchestration of complex tasks that require multiple steps, agent coordination, and long-running processes.”

A2A provides:

  • Task creation and management with persistent state
  • Real-time streaming of progress updates (Server-Sent Events)
  • Agent coordination across multiple services
  • Artifact management for intermediate results

Why Both Protocols Matter

Here’s a real scenario from my fintech work that illustrates why you need both:

Use Case: Compliance analyst needs to research a company across 10,000 documents, verify regulatory compliance, cross-reference with SEC filings, and generate an audit-ready report.

With MCP alone:

  • ? No way to track multi-step progress
  • ? Can’t coordinate multiple tools
  • ? No intermediate result storage
  • ? Client must orchestrate everything

With A2A alone:

  • ? Every tool is custom-integrated
  • ? No standardized data access
  • ? Reinventing authentication per tool
  • ? Coupling agent logic to data sources

With MCP + A2A:

  • ? A2A orchestrates the multi-step workflow
  • ? MCP provides standardized tool execution
  • ? Real-time progress via SSE
  • ? Stateful coordination with stateless tools
  • ? Authentication handled once (JWT in MCP)
  • ? Intermediate results stored as artifacts

As noted in OneReach’s guide:

“Use MCP when you need fast, stateless tool execution. Use A2A when you need complex, stateful orchestration. Use both when building production systems.”


Part 2: Architecture

System Overview

Key Design Decisions

Protocol Servers (Go):

  • MCP Server – Secure document retrieval with pgvector and hybrid search. Go’s concurrency model handles 5,000+ req/sec, and its type safety catches integration bugs at compile time (not at runtime).
  • A2A Server – Multi-step workflow orchestration with Server-Sent Events for real-time progress tracking. Stateless design enables horizontal scaling.

AI Workflows (Python):

  • LangGraph Workflows – RAG, research, and hybrid pipelines. Python was the right choice here because the AI ecosystem (LangChain, embeddings, model integrations) lives in Python.

User Interface & Database:

  • Streamlit UI – Production-ready authentication, search interface, cost tracking dashboard, and real-time task streaming
  • PostgreSQL with pgvector – Multi-tenant document storage with row-level security policies enforced at the database level (not application level)
  • Ollama – Local LLM inference for development and testing (no OpenAI API keys required)

Database Security:

Application-level tenant filtering for database is not enough so row-level security policies are enforced:

// ? BAD: Application-level filtering (can be bypassed)
func GetDocuments(tenantID string) ([]Document, error) {
    query := "SELECT * FROM documents WHERE tenant_id = ?"
    // What if someone forgets the WHERE clause?
    // What if there's a SQL injection?
    // What if a bug skips this check?
}
-- ? GOOD: Database-level Row-Level Security (impossible to bypass)
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON documents
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

Every query automatically filters by tenant so there is no way to accidentally leak data. Even if your application has a bug, the database enforces isolation.

JWT Authentication

MCP server and UI share RSA keys for token verification, which provides:

  • Asymmetric: MCP server only needs public key (can’t forge tokens)
  • Rotation: Rotate private key without redeploying services
  • Auditability: Know which key signed which token
  • Standard: Widely supported, well-understood
// mcp-server/internal/auth/jwt.go
func (v *JWTValidator) ValidateToken(tokenString string) (*Claims, error) {
    token, err := jwt.ParseWithClaims(tokenString, &Claims{}, func(token *jwt.Token) (interface{}, error) {
        if _, ok := token.Method.(*jwt.SigningMethodRSA); !ok {
            return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"])
        }
        return v.publicKey, nil
    })

    if err != nil {
        return nil, fmt.Errorf("failed to parse token: %w", err)
    }

    claims, ok := token.Claims.(*Claims)
    if !ok || !token.Valid {
        return nil, fmt.Errorf("invalid token claims")
    }

    return claims, nil
}

Tokens are validated on every request—no session state, fully stateless.

4. Hybrid Search

In some of past RAG implementation, I used Vector search alone, which is not enough for production RAG.

Why hybrid search matters:

ScenarioBM25 (Keyword)Vector (Semantic)Hybrid
Exact term: “GDPR Article 17”? Perfect? Misses? Perfect
Concept: “right to be forgotten”? Misses? Good? Perfect
Legal citation: “Smith v. Jones 2024”? Perfect? Poor? Perfect
Misspelling: “machien learning”? Misses? Finds? Finds

Real-world example from my fintech work:

Query: "SEC disclosure requirements GDPR data breach"

Vector-only results:
1. "Privacy Policy" (0.87 similarity)
2. "Data Protection Guide" (0.84 similarity)  
3. "General Security Practices" (0.81 similarity)
? Missed: Actual SEC regulation text

Hybrid results (0.5 BM25 + 0.5 Vector):
1. "SEC Rule 10b-5 Disclosure Requirements" (0.92 combined)
2. "GDPR Article 33 Breach Notification" (0.89 combined)
3. "Cross-Border Regulatory Compliance" (0.85 combined)
? Found: Exactly what we needed

The reference implementation (hybrid_search.go) uses PostgreSQL’s full-text search (BM25-like) combined with pgvector:

// Hybrid search query using Reciprocal Rank Fusion
query := `
    WITH bm25_results AS (
        SELECT
            id,
            ts_rank_cd(
                to_tsvector('english', title || ' ' || content),
                plainto_tsquery('english', $1)
            ) AS bm25_score,
            ROW_NUMBER() OVER (ORDER BY ts_rank_cd(...) DESC) AS bm25_rank
        FROM documents
        WHERE to_tsvector('english', title || ' ' || content) @@ plainto_tsquery('english', $1)
    ),
    vector_results AS (
        SELECT
            id,
            1 - (embedding <=> $2) AS vector_score,
            ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS vector_rank
        FROM documents
        WHERE embedding IS NOT NULL
    ),
    combined AS (
        SELECT
            COALESCE(b.id, v.id) AS id,
            -- Reciprocal Rank Fusion score
            (
                COALESCE(1.0 / (60 + b.bm25_rank), 0) * $3 +
                COALESCE(1.0 / (60 + v.vector_rank), 0) * $4
            ) AS combined_score
        FROM bm25_results b
        FULL OUTER JOIN vector_results v ON b.id = v.id
    )
    SELECT * FROM combined
    ORDER BY combined_score DESC
    LIMIT $7
`

Why Reciprocal Rank Fusion (RRF)? Because:

  • Score normalization: BM25 scores and vector similarities aren’t comparable
  • Rank-based: Uses position, not raw scores
  • Research-backed: Used by search engines (Elasticsearch, Vespa)
  • Tunable: Adjust k parameter (60 in our case) for different behaviors

Part 3: The MCP Server – Secure Document Retrieval

Understanding JSON-RPC 2.0

Before we dive into implementation, let’s understand why MCP chose JSON-RPC 2.0.

JSON-RPC 2.0 Request Structure:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "hybrid_search",
    "arguments": {"query": "machine learning", "limit": 10}
  }
}

JSON-RPC 2.0 Response Structure:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [{
      "type": "text",
      "text": "[{\"doc_id\": \"123\", \"title\": \"ML Guide\", ...}]"
    }],
    "isError": false
  }
}

Error Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "error": {
    "code": -32602,
    "message": "Invalid params",
    "data": {"field": "query", "reason": "required"}
  }
}

Standard Error Codes:

  • -32700: Parse error (invalid JSON)
  • -32600: Invalid request (missing required fields)
  • -32601: Method not found
  • -32602: Invalid params
  • -32603: Internal error

Custom MCP Error Codes:

  • -32001: Authentication required
  • -32002: Authorization failed
  • -32003: Rate limit exceeded
  • -32004: Resource not found
  • -32005: Validation error

MCP Tool Implementation

MCP tools follow a standard interface:

// mcp-server/internal/tools/tool.go
type Tool interface {
    Definition() protocol.ToolDefinition
    Execute(ctx context.Context, args map[string]interface{}) (protocol.ToolCallResult, error)
}

Here’s the complete hybrid search tool (hybrid_search.go) implementation with detailed comments:

// mcp-server/internal/tools/hybrid_search.go
type HybridSearchTool struct {
    db database.Store
}

func (t *HybridSearchTool) Execute(ctx context.Context, args map[string]interface{}) (protocol.ToolCallResult, error) {
    // 1. AUTHENTICATION: Extract tenant from JWT claims
    //    This happens at middleware level, but we verify here
    tenantID, ok := ctx.Value(auth.ContextKeyTenantID).(string)
    if !ok {
        return protocol.ToolCallResult{IsError: true}, fmt.Errorf("tenant ID not found in context")
    }

    // 2. PARAMETER PARSING: Extract and validate arguments
    query, _ := args["query"].(string)
    if query == "" {
        return protocol.ToolCallResult{IsError: true}, fmt.Errorf("query is required")
    }
    
    limit, _ := args["limit"].(float64)
    if limit <= 0 {
        limit = 10 // default
    }
    if limit > 50 {
        limit = 50 // max cap
    }
    
    bm25Weight, _ := args["bm25_weight"].(float64)
    vectorWeight, _ := args["vector_weight"].(float64)
    
    // 3. WEIGHT NORMALIZATION: Ensure weights sum to 1.0
    if bm25Weight == 0 && vectorWeight == 0 {
        bm25Weight = 0.5
        vectorWeight = 0.5
    }

    // 4. EMBEDDING GENERATION: Using Ollama for query embedding
    var embedding []float32
    if vectorWeight > 0 {
        embedding = generateEmbedding(query) // Calls Ollama API
    }

    // 5. DATABASE QUERY: Execute hybrid search with RLS
    params := database.HybridSearchParams{
        Query:        query,
        Embedding:    embedding,
        Limit:        int(limit),
        BM25Weight:   bm25Weight,
        VectorWeight: vectorWeight,
    }

    results, err := t.db.HybridSearch(ctx, tenantID, params)
    if err != nil {
        return protocol.ToolCallResult{IsError: true}, err
    }

    // 6. RESPONSE FORMATTING: Convert to JSON for client
    jsonData, _ := json.Marshal(results)
    return protocol.ToolCallResult{
        Content: []protocol.ContentBlock{{Type: "text", Text: string(jsonData)}},
        IsError: false,
    }, nil
}

The NULL Embedding Problem

Real-world data is messy. Not every document has an embedding. Here’s what happened:

Initial Implementation (Broken):

// ? This crashes with NULL embeddings
var embedding pgvector.Vector

err = tx.QueryRow(ctx, query, docID).Scan(
    &doc.ID,
    &doc.TenantID,
    &doc.Title,
    &doc.Content,
    &doc.Metadata,
    &embedding, // CRASH: can't scan <nil> into pgvector.Vector
    &doc.CreatedAt,
    &doc.UpdatedAt,
)

Error:

can't scan into dest[5]: unsupported data type: <nil>

The Fix (Correct):

// ? Use pointer types for nullable fields
var embedding *pgvector.Vector // Pointer allows NULL

err = tx.QueryRow(ctx, query, docID).Scan(
    &doc.ID,
    &doc.TenantID,
    &doc.Title,
    &doc.Content,
    &doc.Metadata,
    &embedding, // Can be NULL now
    &doc.CreatedAt,
    &doc.UpdatedAt,
)

// Handle NULL embeddings gracefully
if embedding != nil && embedding.Slice() != nil {
    doc.Embedding = embedding.Slice()
} else {
    doc.Embedding = nil // Explicitly set to nil
}

return doc, nil

Hybrid search handles this elegantly—documents without embeddings get vector_score = 0 but still appear in results if they match BM25:

-- Hybrid search handles NULL embeddings gracefully
WITH bm25_results AS (
    SELECT id, ts_rank(to_tsvector('english', content), query) AS bm25_score
    FROM documents
    WHERE to_tsvector('english', content) @@ query
),
vector_results AS (
    SELECT id, 1 - (embedding <=> $1) AS vector_score
    FROM documents
    WHERE embedding IS NOT NULL  -- ? Skip NULL embeddings
)
SELECT
    d.*,
    COALESCE(b.bm25_score, 0) AS bm25_score,
    COALESCE(v.vector_score, 0) AS vector_score,
    ($2 * COALESCE(b.bm25_score, 0) + $3 * COALESCE(v.vector_score, 0)) AS combined_score
FROM documents d
LEFT JOIN bm25_results b ON d.id = b.id
LEFT JOIN vector_results v ON d.id = v.id
WHERE COALESCE(b.bm25_score, 0) > 0 OR COALESCE(v.vector_score, 0) > 0
ORDER BY combined_score DESC
LIMIT $4;

Why this matters:

  • ? Documents without embeddings still searchable (BM25)
  • ? New documents usable immediately (embeddings generated async)
  • ? System degrades gracefully (not all-or-nothing)
  • ? Zero downtime for embedding model updates

Tenant Isolation in Action

Every MCP request sets the tenant context at the database transaction level:

// mcp-server/internal/database/postgres.go
func (db *DB) SetTenantContext(ctx context.Context, tx pgx.Tx, tenantID string) error {
    // Note: SET commands don't support parameter binding
    // TenantID is validated as UUID by JWT validator, so this is safe
    query := fmt.Sprintf("SET LOCAL app.current_tenant_id = '%s'", tenantID)
    _, err := tx.Exec(ctx, query)
    return err
}

Combined with RLS policies, this ensures complete tenant isolation at the database level.

Real-world security test:

// Integration test: Verify tenant isolation
func TestTenantIsolation(t *testing.T) {
    // Create documents for two tenants
    tenant1Doc := createDocument(t, db, "tenant-1", "Secret Data A")
    tenant2Doc := createDocument(t, db, "tenant-2", "Secret Data B")
    
    // Query as tenant-1
    ctx1 := contextWithTenant(ctx, "tenant-1")
    results1, _ := db.ListDocuments(ctx1, "tenant-1", ListParams{Limit: 100})
    
    // Query as tenant-2
    ctx2 := contextWithTenant(ctx, "tenant-2")
    results2, _ := db.ListDocuments(ctx2, "tenant-2", ListParams{Limit: 100})
    
    // Assertions
    assert.Contains(t, results1, tenant1Doc)
    assert.NotContains(t, results1, tenant2Doc) // ? Cannot see other tenant
    
    assert.Contains(t, results2, tenant2Doc)
    assert.NotContains(t, results2, tenant1Doc) // ? Cannot see other tenant
}

Part 4: The A2A Server – Workflow Orchestration

Task Lifecycle

A2A manages stateful tasks through their entire lifecycle:

Server-Sent Events for Real-Time Updates

Why SSE instead of WebSockets?

FeatureSSEWebSocket
Unidirectional? Yes (server?client)? No (bidirectional)
HTTP/2 multiplexing? Yes? No
Automatic reconnection? Built-in? Manual
Firewall-friendly? Yes (HTTP)?? Sometimes blocked
Complexity? Simple? Complex
Browser support? All modern? All modern

SSE is perfect for agent progress updates because:

  • One-way communication (server pushes updates)
  • Simple implementation
  • Automatic reconnection
  • Works through corporate firewalls

SSE provides real-time streaming without WebSocket complexity:

// a2a-server/internal/handlers/tasks.go
func (h *TaskHandler) StreamEvents(w http.ResponseWriter, r *http.Request) {
    taskID := chi.URLParam(r, "taskId")

    // Set SSE headers
    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    w.Header().Set("Connection", "keep-alive")
    w.Header().Set("Access-Control-Allow-Origin", "*")

    flusher, ok := w.(http.Flusher)
    if !ok {
        http.Error(w, "Streaming not supported", http.StatusInternalServerError)
        return
    }

    // Stream task events
    for {
        event := h.taskManager.GetNextEvent(taskID)
        if event == nil {
            break // Task complete
        }

        // Format as SSE event
        data, _ := json.Marshal(event)
        fmt.Fprintf(w, "event: task_update\n")
        fmt.Fprintf(w, "data: %s\n\n", data)
        flusher.Flush()

        if event.Status == "completed" || event.Status == "failed" {
            break
        }
    }
}

Client-side consumption is trivial:

# streamlit-ui/pages/3_?_A2A_Tasks.py
def stream_task_events(task_id: str):
    url = f"{A2A_BASE_URL}/tasks/{task_id}/events"

    with requests.get(url, stream=True) as response:
        for line in response.iter_lines():
            if line.startswith(b'data:'):
                data = json.loads(line[5:])
                st.write(f"Update: {data['message']}")
                yield data

LangGraph Workflow Integration

LangGraph workflows call MCP tools through the A2A server:

# orchestration/workflows/rag_workflow.py
class RAGWorkflow:
    def __init__(self, mcp_url: str):
        self.mcp_client = MCPClient(mcp_url)
        self.workflow = self.build_workflow()

    def build_workflow(self) -> StateGraph:
        workflow = StateGraph(RAGState)

        # Define workflow steps
        workflow.add_node("search", self.search_documents)
        workflow.add_node("rank", self.rank_results)
        workflow.add_node("generate", self.generate_answer)
        workflow.add_node("verify", self.verify_sources)

        # Define edges (workflow graph)
        workflow.add_edge(START, "search")
        workflow.add_edge("search", "rank")
        workflow.add_edge("rank", "generate")
        workflow.add_edge("generate", "verify")
        workflow.add_edge("verify", END)

        return workflow.compile()

    def search_documents(self, state: RAGState) -> RAGState:
        """Search for relevant documents using MCP hybrid search"""
        # This is where MCP and A2A integrate!
        results = self.mcp_client.hybrid_search(
            query=state["query"],
            limit=10,
            bm25_weight=0.5,
            vector_weight=0.5
        )

        state["documents"] = results
        state["progress"] = f"Found {len(results)} documents"
        
        # Emit progress event via A2A
        emit_progress_event(state["task_id"], "search_complete", state["progress"])
        
        return state

    def rank_results(self, state: RAGState) -> RAGState:
        """Rank results by combined score"""
        docs = sorted(
            state["documents"],
            key=lambda x: x["score"],
            reverse=True
        )[:5]

        state["ranked_docs"] = docs
        state["progress"] = "Ranked top 5 documents"
        
        emit_progress_event(state["task_id"], "ranking_complete", state["progress"])
        
        return state

    def generate_answer(self, state: RAGState) -> RAGState:
        """Generate answer using retrieved context"""
        context = "\n\n".join([
            f"Document: {doc['title']}\n{doc['content']}"
            for doc in state["ranked_docs"]
        ])

        prompt = f"""Based on the following documents, answer the question.

Context:
{context}

Question: {state['query']}

Answer:"""

        # Call Ollama for local inference
        response = ollama.generate(
            model="llama3.2",
            prompt=prompt
        )

        state["answer"] = response["response"]
        state["progress"] = "Generated final answer"
        
        emit_progress_event(state["task_id"], "generation_complete", state["progress"])
        
        return state
        
    def verify_sources(self, state: RAGState) -> RAGState:
        """Verify sources are accurately cited"""
        # Check each cited document exists in ranked_docs
        cited_docs = extract_citations(state["answer"])
        verified = all(doc in state["ranked_docs"] for doc in cited_docs)
        
        state["verified"] = verified
        state["progress"] = "Verified sources" if verified else "Source verification failed"
        
        emit_progress_event(state["task_id"], "verification_complete", state["progress"])
        
        return state

The workflow executes as a multi-step pipeline, with each step:

  1. Calling MCP tools for data access
  2. Updating state
  3. Emitting progress events via A2A
  4. Handling errors gracefully

Part 5: Production-Grade Features

1. Authentication & Security

JWT Token Generation (Streamlit UI):

# streamlit-ui/pages/1_?_Authentication.py
def generate_jwt_token(tenant_id: str, user_id: str, ttl: int = 3600) -> str:
    """Generate RS256 JWT token with proper claims"""
    now = datetime.now(timezone.utc)

    payload = {
        "tenant_id": tenant_id,
        "user_id": user_id,
        "iat": now,              # Issued at
        "exp": now + timedelta(seconds=ttl),  # Expiration
        "nbf": now,              # Not before
        "jti": str(uuid.uuid4()), # JWT ID (for revocation)
        "iss": "mcp-demo-ui",    # Issuer
        "aud": "mcp-server"      # Audience
    }

    # Sign with RSA private key
    with open("/app/certs/private_key.pem", "rb") as f:
        private_key = serialization.load_pem_private_key(
            f.read(),
            password=None
        )

    token = jwt.encode(payload, private_key, algorithm="RS256")
    return token

Token Validation (MCP Server):

// mcp-server/internal/middleware/auth.go
func AuthMiddleware(validator *auth.JWTValidator) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // 1. Extract token from Authorization header
            authHeader := r.Header.Get("Authorization")
            if authHeader == "" {
                http.Error(w, "missing authorization header", http.StatusUnauthorized)
                return
            }

            tokenString := strings.TrimPrefix(authHeader, "Bearer ")
            
            // 2. Validate token signature and claims
            claims, err := validator.ValidateToken(tokenString)
            if err != nil {
                log.Printf("Token validation failed: %v", err)
                http.Error(w, "invalid token", http.StatusUnauthorized)
                return
            }

            // 3. Check token expiration
            if claims.ExpiresAt.Before(time.Now()) {
                http.Error(w, "token expired", http.StatusUnauthorized)
                return
            }

            // 4. Check token not used before nbf
            if claims.NotBefore.After(time.Now()) {
                http.Error(w, "token not yet valid", http.StatusUnauthorized)
                return
            }

            // 5. Verify audience (prevent token reuse across services)
            if claims.Audience != "mcp-server" {
                http.Error(w, "invalid token audience", http.StatusUnauthorized)
                return
            }

            // 6. Add claims to context for downstream handlers
            ctx := context.WithValue(r.Context(), auth.ContextKeyTenantID, claims.TenantID)
            ctx = context.WithValue(ctx, auth.ContextKeyUserID, claims.UserID)
            ctx = context.WithValue(ctx, auth.ContextKeyJTI, claims.JTI)

            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Key Security Features:

  • ? RSA-256 signatures (asymmetric cryptography – server can’t forge tokens)
  • ? Short-lived tokens (1-hour default, reduces replay attack window)
  • ? JWT ID (jti) for token revocation
  • ? Audience claim prevents token reuse across services
  • ? Tenant and user context in every request
  • ? Database-level isolation via RLS
  • ? No session state (fully stateless, scales horizontally)

2. Cost Tracking & Budgeting

You can avoid unexpected cost from AI usage by tracking costs per user, model, and request:

# streamlit-ui/pages/4_?_Cost_Tracking.py
class CostTracker:
    def __init__(self):
        self.costs = []
        self.pricing = {
            # Local models (Ollama)
            "llama3.2": 0.0001,      # $0.0001 per 1K tokens
            "mistral": 0.0001,
            
            # OpenAI models
            "gpt-4": 0.03,           # $0.03 per 1K tokens
            "gpt-3.5-turbo": 0.002,  # $0.002 per 1K tokens
            
            # Anthropic models
            "claude-3": 0.015,       # $0.015 per 1K tokens
            "claude-3-haiku": 0.0025,
        }

    def track_request(self, user_id: str, model: str, 
                     input_tokens: int, output_tokens: int,
                     metadata: dict = None):
        """Track a single request with detailed token breakdown"""
        
        # Calculate costs
        input_cost = (input_tokens / 1000) * self.pricing.get(model, 0)
        output_cost = (output_tokens / 1000) * self.pricing.get(model, 0)
        total_cost = input_cost + output_cost

        # Store record
        self.costs.append({
            "timestamp": datetime.now(),
            "user_id": user_id,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "input_cost": input_cost,
            "output_cost": output_cost,
            "total_cost": total_cost,
            "metadata": metadata or {}
        })
        
        return total_cost

    def check_budget(self, user_id: str, budget: float) -> tuple[bool, float]:
        """Check if user is within budget"""
        user_costs = [
            c["total_cost"] for c in self.costs
            if c["user_id"] == user_id
        ]

        total_spent = sum(user_costs)
        remaining = budget - total_spent
        
        return remaining > 0, remaining

    def get_usage_by_model(self, user_id: str) -> dict:
        """Get cost breakdown by model"""
        model_costs = {}
        
        for cost in self.costs:
            if cost["user_id"] == user_id:
                model = cost["model"]
                if model not in model_costs:
                    model_costs[model] = {
                        "requests": 0,
                        "total_tokens": 0,
                        "total_cost": 0.0
                    }
                
                model_costs[model]["requests"] += 1
                model_costs[model]["total_tokens"] += cost["input_tokens"] + cost["output_tokens"]
                model_costs[model]["total_cost"] += cost["total_cost"]
        
        return model_costs

Budget Overview Dashboard:

The UI shows:

  • ? Budget remaining per user
  • ? Cost distribution by model (pie chart)
  • ? 7-day spending trend (line chart)
  • ? Alerts when approaching budget limits
  • ? Export to CSV/JSON for accounting

Real-world budget tiers:

# Budget enforcement by user tier
BUDGET_TIERS = {
    "free": {
        "monthly_budget": 0.50,      # $0.50/month
        "rate_limit": 10,            # 10 req/min
        "models": ["llama3.2"]       # Local only
    },
    "pro": {
        "monthly_budget": 25.00,     # $25/month
        "rate_limit": 100,           # 100 req/min
        "models": ["llama3.2", "gpt-3.5-turbo", "claude-3-haiku"]
    },
    "enterprise": {
        "monthly_budget": 500.00,    # $500/month
        "rate_limit": 1000,          # 1000 req/min
        "models": ["*"]              # All models
    }
}

3. Observability

Production-grade observability requires visibility into both infrastructure and LLM behavior. I implemented a dual instrumentation approach:

  • OpenTelemetry: Service-to-service tracing, infrastructure metrics, distributed traces
  • Langfuse: LLM-specific observability (prompts, tokens, costs)

OpenTelemetry excels at infrastructure observability but lacks LLM-specific context. Langfuse provides deep LLM insights but doesn’t trace service-to-service calls. Together, they provide complete visibility.

Example: End-to-End Trace

Python Workflow (OpenTelemetry + Langfuse):

from opentelemetry import trace
from langfuse.decorators import observe

class RAGWorkflow:
    def __init__(self):
        # OTel for distributed tracing
        self.tracer = setup_otel_tracing("rag-workflow")
        # Langfuse for LLM tracking
        self.langfuse = Langfuse(...)

    @observe(name="search_documents")  # Langfuse tracks this
    def _search_documents(self, state):
        # OTel: Create span for MCP call
        with self.tracer.start_as_current_span("mcp.hybrid_search") as span:
            span.set_attribute("search.query", state["query"])
            span.set_attribute("search.top_k", 5)

            # HTTP request auto-instrumented, propagates trace context
            result = self.mcp_client.hybrid_search(
                query=state["query"],
                limit=5
            )

            span.set_attribute("search.result_count", len(documents))
        return state

MCP Client (W3C Trace Context Propagation):

from opentelemetry.propagate import inject

def _make_request(self, method: str, params: Any = None):
    headers = {'Content-Type': 'application/json'}

    # Inject trace context into HTTP headers
    inject(headers)  # Adds 'traceparent' header

    response = self.session.post(
        f"{self.base_url}/mcp",
        json=payload,
        headers=headers  # Trace continues in Go server
    )

Go MCP Server (Receives Trace Context):

func (tm *TracingMiddleware) Handler(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Extract trace context from headers (W3C Trace Context)
        ctx := r.Context()
        propagator := otel.GetTextMapPropagator()
        ctx = propagator.Extract(ctx, propagation.HeaderCarrier(r.Header))

        // Start span - continues the distributed trace
        ctx, span := tm.telemetry.Tracer.Start(ctx, "http.request",
            trace.WithAttributes(
                attribute.String("http.method", r.Method),
                attribute.String("http.url", r.URL.Path),
            ),
        )
        defer span.End()

        // Process request with trace context
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Configuration:

# docker-compose.yml
services:
  mcp-server:
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: jaeger:4318
      OTEL_TRACES_SAMPLER_ARG: "1.0"  # 100% sampling
      OTEL_ENABLE_TRACING: "true"
      OTEL_ENABLE_METRICS: "true"

Key Metrics Exposed:

  • mcp.request.count, mcp.request.duration – Request latency
  • mcp.tool.execution.duration – Tool performance by name
  • mcp.db.query.duration – Database query performance
  • a2a.cost.total, a2a.tokens.total – LLM cost tracking by model

4. Rate Limiting

You can protect servers from abuse:

// mcp-server/internal/middleware/ratelimit.go
import "golang.org/x/time/rate"

func RateLimitMiddleware(limiter *rate.Limiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            if !limiter.Allow() {
                http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
}

// Usage: 100 requests per second per tenant
limiter := rate.NewLimiter(100, 200) // 100 req/sec burst of 200

Per-tenant rate limiting with Redis:

// mcp-server/internal/middleware/ratelimit_redis.go
type RedisRateLimiter struct {
    client *redis.Client
    limit  int
    window time.Duration
}

func (r *RedisRateLimiter) Allow(ctx context.Context, tenantID string) (bool, error) {
    key := fmt.Sprintf("ratelimit:tenant:%s", tenantID)
    
    // Increment counter
    count, err := r.client.Incr(ctx, key).Result()
    if err != nil {
        return false, err
    }
    
    // Set expiration on first request
    if count == 1 {
        r.client.Expire(ctx, key, r.window)
    }
    
    // Check limit
    return count <= int64(r.limit), nil
}

Part 6: Testing

Unit tests with mocks aren’t enough. You need integration tests against real databases to catch:

  • NULL value handling in PostgreSQL
  • Row-level security policies
  • Concurrent access patterns
  • Real embedding operations with pgvector
  • JSON-RPC protocol edge cases
  • JWT token validation
  • Rate limiting behavior

Integration Test Suite

Here’s what I built:

// mcp-server/internal/database/postgres_integration_test.go
func TestGetDocument_WithNullEmbedding(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    ctx := context.Background()

    // Insert document WITHOUT embedding (common in real world)
    testDoc := &Document{
        TenantID:  testTenantID,
        Title:     "Test Document Without Embedding",
        Content:   "This document has no embedding vector",
        Metadata:  map[string]interface{}{"test": true},
        Embedding: nil, // Explicitly no embedding
    }

    err := db.InsertDocument(ctx, testTenantID, testDoc)
    require.NoError(t, err)

    // Retrieve - should NOT fail with NULL scan error
    retrieved, err := db.GetDocument(ctx, testTenantID, testDoc.ID)
    require.NoError(t, err)
    assert.NotNil(t, retrieved)
    assert.Nil(t, retrieved.Embedding) // Embedding is NULL
    assert.Equal(t, testDoc.Title, retrieved.Title)
    assert.Equal(t, testDoc.Content, retrieved.Content)

    // Cleanup
    db.DeleteDocument(ctx, testTenantID, testDoc.ID)
}

func TestHybridSearch_HandlesNullEmbeddings(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    ctx := context.Background()

    // Insert documents with and without embeddings
    docWithEmbedding := createDocumentWithEmbedding(t, db, testTenantID, "AI Guide")
    docWithoutEmbedding := createDocumentWithoutEmbedding(t, db, testTenantID, "ML Tutorial")

    // Create query embedding
    queryEmbedding := make([]float32, 1536)
    for i := range queryEmbedding {
        queryEmbedding[i] = 0.1
    }

    params := HybridSearchParams{
        Query:        "artificial intelligence machine learning",
        Embedding:    queryEmbedding,
        Limit:        10,
        BM25Weight:   0.5,
        VectorWeight: 0.5,
    }

    // Should work even with NULL embeddings
    results, err := db.HybridSearch(ctx, testTenantID, params)
    require.NoError(t, err)
    assert.NotNil(t, results)
    assert.Greater(t, len(results), 0)

    // Documents without embeddings get vector_score = 0
    for _, result := range results {
        if result.Document.Embedding == nil {
            assert.Equal(t, 0.0, result.VectorScore)
            assert.Greater(t, result.BM25Score, 0.0) // But BM25 should work
        }
    }
}

func TestTenantIsolation_CannotAccessOtherTenant(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    tenant1ID := "tenant-1-" + uuid.New().String()
    tenant2ID := "tenant-2-" + uuid.New().String()

    // Create documents for both tenants
    doc1 := createDocument(t, db, tenant1ID, "Tenant 1 Secret Data")
    doc2 := createDocument(t, db, tenant2ID, "Tenant 2 Secret Data")

    // Query as tenant-1
    ctx1 := context.Background()
    results1, err := db.ListDocuments(ctx1, tenant1ID, ListParams{Limit: 100})
    require.NoError(t, err)

    // Query as tenant-2
    ctx2 := context.Background()
    results2, err := db.ListDocuments(ctx2, tenant2ID, ListParams{Limit: 100})
    require.NoError(t, err)

    // Verify isolation
    assert.Contains(t, results1, doc1)
    assert.NotContains(t, results1, doc2) // ? Cannot see other tenant

    assert.Contains(t, results2, doc2)
    assert.NotContains(t, results2, doc1) // ? Cannot see other tenant
}

func TestConcurrentRetrievals_NoRaceConditions(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    // Create test documents
    docs := make([]*Document, 50)
    for i := 0; i < 50; i++ {
        docs[i] = createDocument(t, db, testTenantID, fmt.Sprintf("Document %d", i))
    }

    // Concurrent retrievals
    var wg sync.WaitGroup
    errors := make(chan error, 500)

    for worker := 0; worker < 10; worker++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            for i := 0; i < 50; i++ {
                doc := docs[i]
                retrieved, err := db.GetDocument(context.Background(), testTenantID, doc.ID)
                if err != nil {
                    errors <- err
                    return
                }
                if retrieved.ID != doc.ID {
                    errors <- fmt.Errorf("document mismatch: got %s, want %s", retrieved.ID, doc.ID)
                    return
                }
            }
        }()
    }

    wg.Wait()
    close(errors)

    // Check for errors
    for err := range errors {
        t.Error(err)
    }
}

Test Coverage:

  • ? GetDocument with/without embeddings (NULL handling)
  • ? ListDocuments with mixed states
  • ? SearchDocuments with NULL embeddings
  • ? HybridSearch graceful degradation
  • ? Tenant isolation enforcement (security)
  • ? Concurrent access (10 workers, 50 requests each)
  • ? All 10 sample documents retrievable
  • ? JSON-RPC protocol validation
  • ? JWT token validation
  • ? Rate limiting behavior

Running Tests

# Unit tests (fast, no dependencies)
cd mcp-server
go test -v ./...

# Integration tests (requires PostgreSQL)
./scripts/run-integration-tests.sh

The integration test script:

  1. Checks if PostgreSQL is running
  2. Waits for database ready
  3. Runs all integration tests
  4. Reports coverage

Output:

? Running MCP Server Integration Tests
========================================
? PostgreSQL is ready

? Running integration tests...

=== RUN   TestGetDocument_WithNullEmbedding
--- PASS: TestGetDocument_WithNullEmbedding (0.05s)
=== RUN   TestGetDocument_WithEmbedding
--- PASS: TestGetDocument_WithEmbedding (0.04s)
=== RUN   TestHybridSearch_HandlesNullEmbeddings
--- PASS: TestHybridSearch_HandlesNullEmbeddings (0.12s)
=== RUN   TestTenantIsolation
--- PASS: TestTenantIsolation (0.08s)
=== RUN   TestConcurrentRetrievals
--- PASS: TestConcurrentRetrievals (2.34s)

PASS
coverage: 95.3% of statements
ok  	github.com/bhatti/mcp-a2a-go/mcp-server/internal/database	3.456s

? Integration tests completed!

Part 7: Real-World Use Cases

Use Case 1: Enterprise RAG Search

Scenario: Consulting firm managing 50,000+ contract documents across multiple clients. Each client (tenant) must have complete data isolation. Legal team needs to:

  • Search with exact terms (case citations, contract clauses)
  • Find semantically similar clauses (non-obvious connections)
  • Track who accessed what (audit trail)
  • Enforce budget limits per client matter

Solution: Hybrid search combining BM25 (keywords) and vector similarity (semantics).

# Client code
results = mcp_client.hybrid_search(
    query="data breach notification requirements GDPR Article 33",
    limit=10,
    bm25_weight=0.6,  # Favor exact keyword matches for legal terms
    vector_weight=0.4  # But include semantic similarity
)

for result in results:
    print(f"Document: {result['title']}")
    print(f"BM25 Score: {result['bm25_score']:.2f}")
    print(f"Vector Score: {result['vector_score']:.2f}")
    print(f"Combined: {result['score']:.2f}")
    print(f"Tenant: {result['tenant_id']}")
    print("---")

Output:

Document: GDPR Compliance Framework - Article 33 Analysis
BM25 Score: 0.89  (matched "GDPR", "Article 33", "notification")
Vector Score: 0.76  (understood "data breach requirements")
Combined: 0.84
Tenant: client-acme-legal

Document: Data Breach Response Procedures
BM25 Score: 0.45  (matched "data breach", "notification")
Vector Score: 0.91  (strong semantic match)
Combined: 0.65
Tenant: client-acme-legal

Document: SEC Disclosure Requirements
BM25 Score: 0.78  (matched "requirements", "notification")
Vector Score: 0.52  (weak semantic match)
Combined: 0.67
Tenant: client-acme-legal

Benefits:

  • ? Finds documents with exact terms (“GDPR”, “Article 33”)
  • ? Surfaces semantically similar docs (“privacy breach”, “data protection”)
  • ? Tenant isolation ensures Client A can’t see Client B’s contracts
  • ? Audit trail via structured logging
  • ? Cost tracking per client matter

Use Case 2: Multi-Step Research Workflows

Scenario: Investment analyst needs to research a company across multiple data sources:

  1. Company filings (10-K, 10-Q, 8-K)
  2. Competitor analysis
  3. Market trends
  4. Financial metrics
  5. Regulatory filings
  6. News sentiment

Traditional RAG: Query each source separately, manually synthesize results.

With A2A + MCP: Orchestrate multi-step workflow with progress tracking.

# orchestration/workflows/research_workflow.py
class ResearchWorkflow:
    def build_workflow(self):
        workflow = StateGraph(ResearchState)

        # Define research steps
        workflow.add_node("search_company", self.search_company_docs)
        workflow.add_node("search_competitors", self.search_competitors)
        workflow.add_node("search_financials", self.search_financial_data)
        workflow.add_node("analyze_trends", self.analyze_market_trends)
        workflow.add_node("verify_facts", self.verify_with_sources)
        workflow.add_node("generate_report", self.generate_final_report)

        # Define workflow graph
        workflow.add_edge(START, "search_company")
        workflow.add_edge("search_company", "search_competitors")
        workflow.add_edge("search_competitors", "search_financials")
        workflow.add_edge("search_financials", "analyze_trends")
        workflow.add_edge("analyze_trends", "verify_facts")
        workflow.add_edge("verify_facts", "generate_report")
        workflow.add_edge("generate_report", END)

        return workflow.compile()
    
    def search_company_docs(self, state: ResearchState) -> ResearchState:
        """Step 1: Search company documents via MCP"""
        company = state["company_name"]
        
        # Call MCP hybrid search
        results = self.mcp_client.hybrid_search(
            query=f"{company} business operations revenue products",
            limit=20,
            bm25_weight=0.5,
            vector_weight=0.5
        )
        
        state["company_docs"] = results
        state["progress"] = f"Found {len(results)} company documents"
        
        # Emit progress via A2A SSE
        emit_progress("search_company_complete", state["progress"])
        
        return state
    
    def search_competitors(self, state: ResearchState) -> ResearchState:
        """Step 2: Identify and search competitors"""
        company = state["company_name"]
        
        # Extract competitors from company docs
        competitors = self.extract_competitors(state["company_docs"])
        
        # Search each competitor
        competitor_data = {}
        for competitor in competitors:
            results = self.mcp_client.hybrid_search(
                query=f"{competitor} market share products revenue",
                limit=10
            )
            competitor_data[competitor] = results
        
        state["competitors"] = competitor_data
        state["progress"] = f"Analyzed {len(competitors)} competitors"
        
        emit_progress("search_competitors_complete", state["progress"])
        
        return state
    
    def search_financial_data(self, state: ResearchState) -> ResearchState:
        """Step 3: Extract financial metrics"""
        company = state["company_name"]
        
        # Search for financial documents
        results = self.mcp_client.hybrid_search(
            query=f"{company} revenue earnings profit margin cash flow",
            limit=15,
            bm25_weight=0.7,  # Favor exact financial terms
            vector_weight=0.3
        )
        
        # Extract key metrics
        metrics = self.extract_financial_metrics(results)
        
        state["financials"] = metrics
        state["progress"] = f"Extracted {len(metrics)} financial metrics"
        
        emit_progress("search_financials_complete", state["progress"])
        
        return state
    
    def verify_facts(self, state: ResearchState) -> ResearchState:
        """Step 5: Verify all facts with sources"""
        # Check each claim has supporting document
        claims = self.extract_claims(state["report_draft"])
        
        verified_claims = []
        for claim in claims:
            sources = self.find_supporting_docs(claim, state)
            if sources:
                verified_claims.append({
                    "claim": claim,
                    "sources": sources,
                    "verified": True
                })
        
        state["verified_claims"] = verified_claims
        state["progress"] = f"Verified {len(verified_claims)} claims"
        
        emit_progress("verification_complete", state["progress"])
        
        return state

Benefits:

  • ? Multi-step orchestration with state management
  • ? Real-time progress via SSE (analyst sees each step)
  • ? Intermediate results saved as artifacts
  • ? Each step calls MCP tools for data retrieval
  • ? Final report with verified sources
  • ? Cost tracking across all steps

Use Case 3: Budget-Controlled AI Assistance

Scenario: SaaS company (e.g., document management platform) offers AI features to customers based on tiered subscription: Without budget control: Customer on free tier makes 10,000 queries in one day.

With budget control:

# Before each request
tier = get_user_tier(user_id)
budget = BUDGET_TIERS[tier]["monthly_budget"]
allowed, remaining = cost_tracker.check_budget(user_id, budget)

if not allowed:
    raise BudgetExceededError(
        f"Monthly budget of ${budget} exceeded. "
        f"Upgrade to {next_tier} for higher limits."
    )

# Track the request
response = llm.generate(prompt)
cost = cost_tracker.track_request(
    user_id=user_id,
    model="llama3.2",
    input_tokens=len(prompt.split()),
    output_tokens=len(response.split())
)

# Alert when approaching limit
if remaining < 5.0:  # $5 remaining
    send_alert(user_id, f"Budget alert: ${remaining:.2f} remaining")

Real-world budget enforcement:

# streamlit-ui/pages/4_?_Cost_Tracking.py
def enforce_budget_limits():
    """Check budget before task creation"""
    
    user_tier = st.session_state.get("user_tier", "free")
    budget = BUDGET_TIERS[user_tier]["monthly_budget"]
    
    # Calculate current spend
    spent = cost_tracker.get_total_cost(user_id)
    remaining = budget - spent
    
    # Display budget status
    col1, col2, col3 = st.columns(3)
    
    with col1:
        st.metric("Budget", f"${budget:.2f}")
    
    with col2:
        st.metric("Spent", f"${spent:.2f}", 
                 delta=f"-${spent:.2f}", delta_color="inverse")
    
    with col3:
        progress = (spent / budget) * 100
        st.metric("Remaining", f"${remaining:.2f}")
        st.progress(progress / 100)
    
    # Block if exceeded
    if remaining <= 0:
        st.error("? Monthly budget exceeded. Upgrade to continue.")
        st.button("Upgrade to Pro ($25/month)", on_click=upgrade_tier)
        return False
    
    # Warn if close
    if remaining < 5.0:
        st.warning(f"?? Budget alert: Only ${remaining:.2f} remaining this month")
    
    return True

Benefits:

  • ? Prevent cost overruns per customer
  • ? Fair usage enforcement across tiers
  • ? Export data for billing/accounting
  • ? Different limits per tier
  • ? Automatic alerts before limits
  • ? Graceful degradation (local models for free tier)

Part 8: Deployment & Operations

Docker Compose Setup

Everything runs in containers with health checks:

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: mcp_db
      POSTGRES_USER: mcp_user
      POSTGRES_PASSWORD: ${DB_PASSWORD:-mcp_secure_pass}
    volumes:
      - ./scripts/init-db.sql:/docker-entrypoint-initdb.d/init.sql
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mcp_user -d mcp_db"]
      interval: 5s
      timeout: 5s
      retries: 5

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 3

  mcp-server:
    build:
      context: ./mcp-server
      dockerfile: Dockerfile
    environment:
      DB_HOST: postgres
      DB_PORT: 5432
      DB_USER: mcp_user
      DB_PASSWORD: ${DB_PASSWORD:-mcp_secure_pass}
      DB_NAME: mcp_db
      JWT_PUBLIC_KEY_PATH: /app/certs/public_key.pem
      OLLAMA_URL: http://ollama:11434
      LOG_LEVEL: ${LOG_LEVEL:-info}
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      ollama:
        condition: service_healthy
    volumes:
      - ./certs:/app/certs:ro
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  a2a-server:
    build:
      context: ./a2a-server
      dockerfile: Dockerfile
    environment:
      MCP_SERVER_URL: http://mcp-server:8080
      OLLAMA_URL: http://ollama:11434
      LOG_LEVEL: ${LOG_LEVEL:-info}
    ports:
      - "8082:8082"
    depends_on:
      - mcp-server
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8082/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  streamlit-ui:
    build:
      context: ./streamlit-ui
      dockerfile: Dockerfile
    environment:
      MCP_SERVER_URL: http://mcp-server:8080
      A2A_SERVER_URL: http://a2a-server:8082
    ports:
      - "8501:8501"
    volumes:
      - ./certs:/app/certs:ro
    depends_on:
      - mcp-server
      - a2a-server

volumes:
  postgres_data:
  ollama_data:

Startup & Verification

# Start all services
docker compose up -d

# Check status
docker compose ps

# Expected output:
# NAME              STATUS        PORTS
# postgres          Up (healthy)  0.0.0.0:5432->5432/tcp
# ollama            Up (healthy)  0.0.0.0:11434->11434/tcp
# mcp-server        Up (healthy)  0.0.0.0:8080->8080/tcp
# a2a-server        Up (healthy)  0.0.0.0:8082->8082/tcp
# streamlit-ui      Up            0.0.0.0:8501->8501/tcp

# View logs
docker compose logs -f mcp-server
docker compose logs -f a2a-server

# Run health checks
curl http://localhost:8080/health  # MCP server
curl http://localhost:8082/health  # A2A server

# Pull Ollama model
docker compose exec ollama ollama pull llama3.2

# Initialize database with sample data
docker compose exec postgres psql -U mcp_user -d mcp_db -f /docker-entrypoint-initdb.d/init.sql

Production Considerations

1. Environment Variables (Don’t Hardcode Secrets)

# .env.production
DB_PASSWORD=$(openssl rand -base64 32)
JWT_PRIVATE_KEY_PATH=/secrets/jwt_private_key.pem
JWT_PUBLIC_KEY_PATH=/secrets/jwt_public_key.pem
LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}
LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}
OLLAMA_URL=http://ollama:11434
LOG_LEVEL=info
SENTRY_DSN=${SENTRY_DSN}

2. Database Migrations

Use golang-migrate for schema management:

# Install migrate
curl -L https://github.com/golang-migrate/migrate/releases/download/v4.16.2/migrate.linux-amd64.tar.gz | tar xvz
mv migrate /usr/local/bin/

# Create migration
migrate create -ext sql -dir db/migrations -seq add_embeddings_index

# Apply migrations
migrate -path db/migrations \
        -database "postgresql://user:pass@localhost:5432/db?sslmode=disable" \
        up

# Rollback if needed
migrate -path db/migrations \
        -database "postgresql://user:pass@localhost:5432/db?sslmode=disable" \
        down 1

3. Kubernetes Deployment

The repository includes Kubernetes manifests:

# k8s/mcp-server-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
  namespace: mcp-a2a
spec:
  replicas: 3  # High availability
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
      - name: mcp-server
        image: ghcr.io/bhatti/mcp-server:latest
        ports:
        - containerPort: 8080
        env:
        - name: DB_HOST
          value: postgres-service
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: JWT_PUBLIC_KEY_PATH
          value: /certs/public_key.pem
        volumeMounts:
        - name: certs
          mountPath: /certs
          readOnly: true
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3
      volumes:
      - name: certs
        secret:
          secretName: jwt-certs

Deploy to Kubernetes:

# Create namespace
kubectl create namespace mcp-a2a

# Apply secrets
kubectl create secret generic db-credentials \
  --from-literal=password=$(openssl rand -base64 32) \
  -n mcp-a2a

kubectl create secret generic jwt-certs \
  --from-file=public_key.pem=./certs/public_key.pem \
  --from-file=private_key.pem=./certs/private_key.pem \
  -n mcp-a2a

# Apply manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/postgres.yaml
kubectl apply -f k8s/mcp-server.yaml
kubectl apply -f k8s/a2a-server.yaml
kubectl apply -f k8s/streamlit-ui.yaml

# Check pods
kubectl get pods -n mcp-a2a

# View logs
kubectl logs -f deployment/mcp-server -n mcp-a2a

# Scale up
kubectl scale deployment mcp-server --replicas=5 -n mcp-a2a

4. Monitoring & Alerts

Add Prometheus metrics:

// mcp-server/internal/metrics/prometheus.go
var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "mcp_request_duration_seconds",
            Help: "MCP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "status"},
    )

    activeRequests = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "mcp_active_requests",
            Help: "Number of active MCP requests",
        },
    )
    
    hybridSearchQueries = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "mcp_hybrid_search_queries_total",
            Help: "Total number of hybrid search queries",
        },
        []string{"tenant_id"},
    )
    
    budgetExceeded = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "mcp_budget_exceeded_total",
            Help: "Number of requests blocked due to budget limits",
        },
        []string{"user_id", "tier"},
    )
)

func init() {
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeRequests)
    prometheus.MustRegister(hybridSearchQueries)
    prometheus.MustRegister(budgetExceeded)
}

Alert rules (Prometheus):

# prometheus/alerts.yml
groups:
- name: mcp_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(mcp_request_duration_seconds_count{status="error"}[5m]) > 0.1
    for: 5m
    annotations:
      summary: "High error rate on MCP server"
      description: "Error rate is {{ $value }} errors/sec"
  
  - alert: BudgetExceededRate
    expr: rate(mcp_budget_exceeded_total[1h]) > 100
    annotations:
      summary: "High budget exceeded rate"
      description: "{{ $value }} users hitting budget limits per hour"
  
  - alert: DatabaseLatency
    expr: mcp_request_duration_seconds{method="hybrid_search"} > 1.0
    for: 2m
    annotations:
      summary: "Slow hybrid search queries"
      description: "Hybrid search taking {{ $value }}s (should be <1s)"

5. Backup & Recovery

Automated PostgreSQL backups:

#!/bin/bash
# scripts/backup-database.sh

BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="mcp_db"
DB_USER="mcp_user"

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Dump database
docker compose exec -T postgres pg_dump -U ${DB_USER} ${DB_NAME} | \
    gzip > ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz

# Upload to S3 (optional)
aws s3 cp ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz \
    s3://my-backups/mcp-db/

# Keep last 7 days locally
find ${BACKUP_DIR} -name "${DB_NAME}_*.sql.gz" -mtime +7 -delete

echo "Backup completed: ${DB_NAME}_${TIMESTAMP}.sql.gz"

Part 9: Performance & Scalability

Benchmarks (Single Instance)

MCP Server (Go):

Benchmark: Hybrid Search (10 results, 1536-dim embeddings)
- Requests/sec: 5,247
- P50 latency: 12ms
- P95 latency: 45ms
- P99 latency: 89ms
- Memory: 52MB baseline, 89MB under load
- CPU: 23% average (4 cores)

Database (PostgreSQL + pgvector):

Benchmark: Vector search (cosine similarity)
- Documents: 100,000
- Embedding dimensions: 1536
- Index: HNSW (m=16, ef_construction=64)
- Query time: <100ms (P95)
- Throughput: 150 queries/sec (single connection)
- Concurrent queries: 100+ simultaneous

Why these numbers matter:

  • 5,000+ req/sec means 432 million requests/day per instance
  • <100ms search means interactive UX
  • 52MB memory means cost-effective scaling

Load Testing Results

# Using hey (HTTP load generator)
hey -n 10000 -c 100 -m POST \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"hybrid_search","arguments":{"query":"machine learning","limit":10}}}' \
    http://localhost:8080/mcp

Summary:
  Total:        19.8421 secs
  Slowest:      0.2847 secs
  Fastest:      0.0089 secs
  Average:      0.1974 secs
  Requests/sec: 503.98
  
  Status code distribution:
    [200]	10000 responses

Latency distribution:
  10% in 0.0234 secs
  25% in 0.0456 secs
  50% in 0.1842 secs
  75% in 0.3123 secs
  90% in 0.4234 secs
  95% in 0.4867 secs
  99% in 0.5634 secs

Scaling Strategy

Horizontal Scaling:

  1. MCP and A2A servers are stateless—scale with container replicas
  2. Database read replicas for read-heavy workloads (search queries)
  3. Redis cache for frequently accessed queries (30-second TTL)
  4. Load balancer distributes requests (sticky sessions not needed)

Vertical Scaling:

  1. Increase PostgreSQL resources for larger datasets
  2. Add pgvector HNSW indexes for faster vector search
  3. Tune connection pool sizes (PgBouncer)

When to scale what:

SymptomSolution
High MCP server CPUAdd more MCP replicas
Slow database queriesAdd read replicas
High memory on MCPCheck for memory leaks, add replicas
Cache missesIncrease Redis memory, tune TTL
Slow embeddingsDeploy dedicated embedding service

Part 10: Lessons Learned & Best Practices

1. Go for Protocol Servers

Go’s performance and type safety provides a good support for AI deployment in production.

2. PostgreSQL Row-Level Security

Database-level tenant isolation is non-negotiable for enterprise. Application-level filtering is too easy to screw up. With RLS, even if your application has a bug, the database enforces isolation.

3. Integration Tests Against Real Databases

Unit tests with mocks didn’t catch the NULL embedding issues. Integration tests did. Test against production-like environments.

4. Optional Langfuse

Making Langfuse optional (try/except imports) lets developers run locally without complex setup while enabling full observability in production.

5. Comprehensive Documentation

Document your design and testing process from day one.

6. Structured Logging

Add structured logging (JSON format):

// ? Structured logging
log.Info().
    Str("tenant_id", tenantID).
    Str("user_id", userID).
    Int("results_count", len(results)).
    Float64("duration_ms", duration.Milliseconds()).
    Msg("hybrid search completed")

Benefits of structured logging:

  • Easy filtering: jq '.tenant_id == "acme-corp"' logs.json
  • Metrics extraction: jq -r '.duration_ms' logs.json | stats
  • Correlation: Trace requests across services
  • Alerting: Monitor error patterns

7. Rate Limiting Per Tenant (Not Global)

Implement per-tenant rate limiting using Redis or other similar frameworks:

// ? Per-tenant rate limiting
type RedisRateLimiter struct {
    client *redis.Client
}

func (r *RedisRateLimiter) Allow(ctx context.Context, tenantID string, limit int) (bool, error) {
    key := fmt.Sprintf("ratelimit:tenant:%s", tenantID)
    
    pipe := r.client.Pipeline()
    incr := pipe.Incr(ctx, key)
    pipe.Expire(ctx, key, time.Minute)
    _, err := pipe.Exec(ctx)
    if err != nil {
        return false, err
    }
    
    count, err := incr.Result()
    if err != nil {
        return false, err
    }
    
    return count <= int64(limit), nil
}

Why this matters:

  • One tenant can’t DoS the system
  • Fair resource allocation
  • Tiered pricing based on limits
  • Tenant-specific SLAs

8. Embedding Generation Service

Ollama works, but a dedicated embedding service (e.g., sentence-transformers FastAPI service) would be:

  • Faster: Batch processing
  • More reliable: Health checks, retries
  • Scalable: Independent scaling
# embeddings-service/app.py (what I should have built)
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.post("/embed")
async def embed(texts: list[str]):
    embeddings = model.encode(texts, batch_size=32)
    return {"embeddings": embeddings.tolist()}

9. Circuit Breaker Pattern

When Ollama is down, the entire system hangs waiting for embeddings so implement circuit breaker for graceful fallback strategies:

// ? Circuit breaker pattern
type CircuitBreaker struct {
    maxFailures int
    timeout     time.Duration
    failures    int
    lastFail    time.Time
    state       string  // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == "open" {
        if time.Since(cb.lastFail) > cb.timeout {
            cb.state = "half-open"
        } else {
            return fmt.Errorf("circuit breaker open")
        }
    }
    
    err := fn()
    if err != nil {
        cb.failures++
        cb.lastFail = time.Now()
        
        if cb.failures >= cb.maxFailures {
            cb.state = "open"
        }
        return err
    }
    
    cb.failures = 0
    cb.state = "closed"
    return nil
}

10. Dual Observability

Use both Langfuse and OpenTelemetry. OTel traces service flow, Langfuse tracks LLM behavior. They complement, not replace each other.

  • OpenTelemetry for infrastructure: Trace context propagation across Python ? Go ? Database gave complete visibility into request flow. The traceparent header auto-propagation through requests/httpx made it seamless.
  • Langfuse for LLM calls: Token counts, costs, and prompt tracking. Essential for budget control and debugging LLM behavior.
  • Prometheus + Jaeger: Prometheus for metrics dashboards (query “What’s our P95 latency?”), Jaeger for debugging specific slow traces (“Why was this request slow?”).

Testing tip:

# Verify trace propagation
curl -X POST http://localhost:8080/mcp \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

# Check Jaeger UI for end-to-end trace
open http://localhost:16686

# Verify metrics
curl http://localhost:9090/api/v1/query \
  --data-urlencode 'query=mcp_request_duration_bucket'

Production Checklist

Before going live, ensure you have:

Security:

  • ? JWT authentication with RSA keys
  • ? Row-level security enforced at database
  • ? Secrets in environment variables (not hardcoded)
  • ? HTTPS/TLS certificates
  • ? API key rotation policy
  • ? Audit logging for sensitive operations

Scalability:

  • ? Stateless servers (can scale horizontally)
  • ? Database connection pooling (PgBouncer)
  • ? Read replicas for query workloads
  • ? Caching layer (Redis)
  • ? Load balancer configured
  • ? Auto-scaling rules defined

Observability:

  • ? Structured logging (JSON format)
  • ? Distributed tracing (Jaeger/Zipkin)
  • ? Metrics collection (Prometheus)
  • ? Dashboards (Grafana)
  • ? Alerting rules configured
  • ? On-call rotation defined

Reliability:

  • ? Health check endpoints (/health)
  • ? Graceful shutdown handlers
  • ? Rate limiting implemented
  • ? Budget enforcement active
  • ? Circuit breakers for external services
  • ? Backup strategy automated

Testing:

  • ? Integration tests passing (95%+ coverage)
  • ? Load testing completed
  • ? Security testing (pen test)
  • ? Disaster recovery tested
  • ? Rollback procedure documented

Operations:

  • ? Deployment automation (CI/CD)
  • ? Monitoring alerts configured
  • ? Runbooks for common issues
  • ? Incident response plan
  • ? Backup and recovery tested
  • ? Capacity planning done

Conclusion: MCP + A2A = Production-Grade AI

Here’s what we built:

? MCP Server – Secure, multi-tenant document retrieval (5,000+ req/sec)
? A2A Server – Stateful workflow orchestration with SSE streaming
? LangGraph Workflows – Multi-step RAG and research pipelines
? 200+ Tests – 95% coverage with integration tests against real databases
? Production Ready – Auth, observability, cost tracking, rate limiting, K8s deployment

But here’s the uncomfortable truth: None of this was in the MCP or A2A specifications. The Protocols Are Just 10% of the Work:

MCP defines:

  • ? JSON-RPC 2.0 message format
  • ? Tool call/response structure
  • ? Resource access patterns

A2A defines:

  • ? Task lifecycle states
  • ? Agent card format
  • ? SSE event structure

What they DON’T define:

  • ? Authentication and authorization
  • ? Multi-tenant isolation
  • ? Rate limiting and cost control
  • ? Observability and tracing
  • ? Circuit breakers and timeouts
  • ? Encryption and compliance
  • ? Disaster recovery

This is by design—protocols define interfaces, not implementations. But it means every production deployment must solve these problems independently.

Why Default Implementations Are Dangerous

Reference implementations are educational tools, not deployment blueprints. Here’s what’s missing:

# ? Typical MCP tutorial
def handle_request(request):
    tool = request["params"]["name"]
    args = request["params"]["arguments"]
    return execute_tool(tool, args)  # No auth, no validation, no limits
// ? Production reality
func (h *MCPHandler) handleToolsCall(ctx context.Context, req *protocol.Request) {
    // 1. Authenticate (JWT validation)
    // 2. Authorize (check permissions)
    // 3. Rate limit (per-tenant quotas)
    // 4. Validate input (prevent injection)
    // 5. Inject tenant context (RLS)
    // 6. Trace request (observability)
    // 7. Track cost (budget enforcement)
    // 8. Circuit breaker (fail fast)
    // 9. Retry logic (handle transients)
    // 10. Audit log (compliance)
    
    return h.toolRegistry.Execute(ctx, toolReq.Name, toolReq.Arguments)
}

That’s 10 layers of production concerns. Miss one, and you have a security incident waiting to happen.

Distributed Systems Lessons Apply Here

AI agents are distributed systems. The problems from microservices apply, because agents make autonomous decisions with potentially unbounded costs. From my fault tolerance article, these patterns are essential:

Without timeouts:

embedding = ollama.embed(text)  # Ollama down ? hangs forever ? system freezes

With timeouts:

ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
embedding, err := ollama.Embed(ctx, text)
if err != nil {
    return db.BM25Search(ctx, query)  // Degrade gracefully, skip embeddings
}

Without circuit breakers:

for task in tasks:
    result = external_api.call(task)  # Fails 1000 times, wastes time/money

With circuit breakers:

if circuitBreaker.IsOpen() {
    return cachedResult  // Fail fast, don't waste resources
}

Without rate limiting:

Tenant A: 10,000 req/sec ? Database crashes ? ALL tenants down

With rate limiting:

if !rateLimiter.Allow(tenantID) {
    return ErrRateLimitExceeded  // Other tenants unaffected
}

The Bottom Line

MCP and A2A are excellent protocols. They solve real problems:

  • ? MCP standardizes tool execution
  • ? A2A standardizes agent coordination

But protocols are not products. Building on MCP/A2A is like building on HTTP—the protocol is solved, but you still need web servers, frameworks, security layers, and monitoring tools.

This repository shows the other 90%:

  • Real authentication (not “TODO: add auth”)
  • Real multi-tenancy (database RLS, not app filtering)
  • Real observability (Langfuse integration, not “we should add logging”)
  • Real testing (integration tests, not just mocks)
  • Real deployment (K8s manifests, not “works on my laptop”)

Get Started

git clone https://github.com/bhatti/mcp-a2a-go
cd mcp-a2a-go
docker compose up -d
./scripts/run-integration-tests.sh
open http://localhost:8501

Resources


November 22, 2025

Testing Distributed Systems Failures with Interactive Simulators

Filed under: Computing — admin @ 10:31 pm

Introduction

Building distributed systems means confronting failure modes that are nearly impossible to reproduce in development or testing environments. How do you test for metastable failures that only emerge under specific load patterns? How do you validate that your quorum-based system actually maintains consistency during network partitions? How do you catch cross-system interaction bugs when both systems work perfectly in isolation? Integration testing, performance testing, and chaos engineering all help, but they have limitations. For the past few years, I’ve been using simulation to validate boundary conditions that are hard to test in real environments. Interactive simulators let you tweak parameters, trigger failure scenarios, and see the consequences immediately through metrics and visualizations.

In this post, I will share four simulators I’ve built to explore the failure modes and consistency challenges that are hardest to test in real systems:

  1. Metastable Failure Simulator: Demonstrates how retry storms create self-sustaining collapse
  2. CAP/PACELC Consistency Simulator: Shows the real tradeoffs between consistency, availability, and latency
  3. CRDT Simulator: Explores conflict-free convergence without coordination
  4. Cross-System Interaction (CSI) Failure Simulator: Reveals how correct systems fail through their interactions

Each simulator is built on research findings and real-world incidents. The goal isn’t just to understand these failure modes intellectually, but to develop intuition through experimentation. All simulators available at: https://github.com/bhatti/simulators.


Part 1: Metastable Failures

The Problem: When Systems Attack Themselves

Metastable failures are particularly insidious because the initial trigger can be small and transient, but the system remains degraded long after the trigger is gone. Research in the metastable failures has shown that traditional fault tolerance mechanisms don’t protect against metastability because the failure is self-sustaining through positive feedback loops in retry logic and coordination overhead. The mechanics are deceptively simple:

  1. A transient issue (network blip, brief CPU spike) causes some requests to slow down
  2. Slow requests start timing out
  3. Clients retry timed-out requests, adding more load
  4. Additional load increases coordination overhead (locks, queues, resource contention)
  5. Higher overhead increases latency further
  6. More timeouts trigger more retries
  7. The system is now in a stable degraded state, even though the original trigger is gone

For example, AWS Kinesis experienced a 7+ hour outage in 2020 where a transient metadata mismatch triggered retry storms across the fleet. Even after the original issue was fixed, the retry behavior kept the system degraded. The recovery required externally rate-limiting client retries.

How the Simulator Works

The metastable failure simulator models this feedback loop using discrete event simulation (SimPy). Here’s what it simulates:

Server Model:

  • Base latency: Time to process a request with no contention
  • Concurrency slope: Additional latency per concurrent request (coordination cost)
  • Capacity: Maximum concurrent requests before queueing
# Latency grows linearly with active requests
def current_latency(self):
    return self.base_latency + (self.active_requests * self.concurrency_slope)

Client Model:

  • Timeout threshold: When to give up on a request
  • Max retries: How many times to retry
  • Backoff strategy: Exponential backoff with jitter (configurable)

Load Patterns:

  • Constant: Steady baseline load
  • Spike: Sudden increase for a duration, then back to baseline
  • Ramp: Gradual increase and decrease

Key Parameters to Experiment With:

ParameterWhat It TestsTypical Values
server_capacityHow many concurrent requests before queueing20-100
base_latencyProcessing time without contention0.1-1.0s
concurrency_slopeCoordination overhead per request0.001-0.05s
timeoutWhen clients give up1-10s
max_retriesRetry attempts before failure0-5
backoff_enabledWhether to add jitter and delaysTrue/False

What You Can Learn:

  1. Trigger a metastable failure: Set spike load high, timeout low, disable backoff ? watch P99 latency stay high after spike ends
  2. See recovery with backoff: Same scenario but enable exponential backoff ? system recovers when spike ends
  3. Understand the tipping point: Gradually increase concurrency slope ? observe when retry amplification begins
  4. Test admission control: Set low server capacity ? see benefit of failing fast vs queueing

The simulator tracks success rate, retry count, timeout count, and latency percentiles over time, letting you see exactly when the system tips into metastability and whether it recovers. With this simulator you can validate various prevention strategies such as:

  • Exponential backoff with jitter spreads retries over time
  • Adaptive retry budgets limit total fleet-wide retries
  • Circuit breakers detect patterns and stop retry storms
  • Load shedding rejects requests before queues explode

Part 2: CAP and PACELC

The CAP theorem correctly states that during network partitions, you must choose between consistency and availability. However, as Daniel Abadi and others have pointed out, this only addresses partition scenarios. Most systems spend 99.99% of their time in normal operation, where the real tradeoff is between latency and consistency. This is where PACELC comes in:

  • If Partition happens: choose Availability or Consistency
  • Else (normal operation): choose Latency or Consistency

PACELC provides a more complete framework for understanding real-world distributed databases:

PA/EL Systems (DynamoDB, Cassandra, Riak):

  • Partition ? Choose Availability (serve stale data)
  • Normal ? Choose Latency (1-2ms reads from any replica)
  • Use when: Shopping carts, session stores, high write throughput needed

PC/EC Systems (Google Spanner, VoltDB, HBase):

  • Partition ? Choose Consistency (reject operations)
  • Normal ? Choose Consistency (5-100ms for quorum coordination)
  • Use when: Financial transactions, inventory, anything that can’t be wrong

PA/EC Systems (MongoDB):

  • Partition ? Choose Availability (with caveats – unreplicated writes go to rollback)
  • Normal ? Choose Consistency (strong reads/writes in baseline)
  • Use when: Mixed workloads with mostly consistent needs

PC/EL Systems (PNUTS):

  • Partition ? Choose Consistency
  • Normal ? Choose Latency (async replication)
  • Use when: Read-heavy with timeline consistency acceptable

Quorum Consensus: Strong Consistency with Coordination

When R + W > N (read quorum + write quorum > total replicas), the read and write sets must overlap in at least one node. This overlap ensures that any read sees at least one node with the latest write, providing linearizability.

Example with N=5, R=3, W=3:

  • Write to replicas {1, 2, 3}
  • Read from replicas {2, 3, 4}
  • Overlap at {2, 3} guarantees we see the latest value

Critical Nuances:

R + W > N alone is NOT sufficient for linearizability in practice. You need additional mechanisms: readers must perform read repair synchronously before returning results, and writers must read the latest state from a quorum before writing. “Last write wins” based on wall-clock time breaks linearizability due to clock skew. Sloppy quorums like those used in Dynamo are NOT linearizable because the nodes in the quorum can change during failures. Even R = W = N doesn’t guarantee consistency if cluster membership changes. Google Spanner uses atomic clocks and GPS to achieve strong consistency globally, with TrueTime API providing less than 1ms clock uncertainty at the 99th percentile as of 2023.

How the Simulator Works

The CAP/PACELC simulator lets you explore these tradeoffs by configuring different consistency models and observing their behavior during normal operation and network partitions.

System Model:

  • N replica nodes, each with local storage
  • Configurable schema for data (to test compatibility)
  • Network latency between nodes (WAN vs LAN)
  • Optional partition mode (splits cluster)

Consistency Levels:

  1. Strong (R+W>N): Quorum reads and writes, linearizable
  2. Linearizable (R=W=N): All nodes must respond, highest consistency
  3. Weak (R=1, W=1): Single node, eventual consistency
  4. Eventual: Async replication, high availability
def get_quorum_size(self, operation_type):
    if self.consistency_level == ConsistencyLevel.STRONG:
        return (self.n_nodes // 2) + 1  # Majority
    elif self.consistency_level == ConsistencyLevel.LINEARIZABLE:
        return self.n_nodes  # All nodes
    elif self.consistency_level == ConsistencyLevel.WEAK:
        return 1  # Any node

Key Parameters:

ParameterWhat It TestsImpact
n_nodesReplica countMore nodes = more fault tolerance but higher coordination cost
consistency_levelStrong/Eventual/etcDirectly controls latency vs consistency tradeoff
base_latencyNode processing timeBaseline performance
network_latencyInter-node delayWAN (50-150ms) vs LAN (1-10ms) dramatically affects quorum cost
partition_activeNetwork partitionTests CAP behavior (A vs C during partition)
write_ratioRead/write mixWrite-heavy shows coordination bottleneck

What You Can Learn:

  1. Latency cost of consistency:
    • Run with Strong (R=3,W=3) at network_latency=5ms ? ~15ms operations
    • Same at network_latency=100ms ? ~300ms operations
    • Switch to Weak (R=1,W=1) ? single-digit milliseconds regardless
  2. CAP during partitions:
    • Enable partition with Strong consistency ? operations fail (choosing C over A)
    • Enable partition with Eventual ? stale reads but available (choosing A over C)
  3. Quorum size tradeoffs:
    • Linearizable (R=W=N) ? single node failure breaks everything
    • Strong (R=W=3 of N=5) ? can tolerate 2 node failures
    • Measure failure rate vs consistency guarantees
  4. Geographic distribution:
    • Network latency 10ms (same datacenter) ? quorum cost moderate
    • Network latency 150ms (cross-continent) ? quorum cost severe
    • Observe when you should use eventual consistency for geo-distribution

The simulator tracks write/read latencies, inconsistent reads, failed operations, and success rates, giving you quantitative data on the tradeoffs.

Key Insights from Simulation

The simulator reveals that most architectural decisions are driven by normal operation latency, not partition handling. If you’re building a global system with 150ms cross-region latency, strong consistency means every operation takes 150ms+ for quorum coordination. That’s often unacceptable for user-facing features. This is why hybrid approaches are becoming standard: use strong consistency for critical invariants (financial transactions, inventory), eventual consistency for everything else (user profiles, preferences).


Part 3: CRDTs

CRDTs (Conflict-Free Replicated Data Types) provide strong eventual consistency (SEC) through mathematical guarantees, not probabilistic convergence. They work without coordination, consensus, or concurrency control. CRDTs rely on operations being commutative (order doesn’t matter), merge functions being associative and idempotent (forming a semilattice), and updates being monotonic according to a partial order.

Example: G-Counter (Grow-Only Counter)

class GCounter:
    def __init__(self, replica_id):
        self.counts = {}  # replica_id -> count
    
    def increment(self, amount=1):
        # Each replica tracks its own increments
        self.counts[self.replica_id] = self.counts.get(self.replica_id, 0) + amount
    
    def value(self):
        # Total is sum of all replicas
        return sum(self.counts.values())
    
    def merge(self, other):
        # Take max of each replica's count
        for replica_id, count in other.counts.items():
            self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)

Why this works:

  • Each replica only increments its own counter (no conflicts)
  • Merge takes max (idempotent: max(a,a) = a)
  • Order doesn’t matter: max(max(a,b),c) = max(a,max(b,c))
  • Eventually all replicas see all increments ? convergence

CRDT Types

There are two main approaches: State-based CRDTs (CvRDTs) send full local state and require merge functions to be commutative, associative, and idempotent. Operation-based CRDTs (CmRDTs) transmit only update operations and require reliable delivery in causal order. Delta-state CRDTs combine the advantages by transmitting compact deltas.

Four CRDTs in the Simulator:

  1. G-Counter: Increment only, perfect for metrics
  2. PN-Counter: Increment and decrement (two G-Counters)
  3. OR-Set: Add/remove elements, concurrent add wins
  4. LWW-Map: Last-write-wins with timestamps

Production systems using CRDTs include Redis Enterprise (CRDBs), Riak, Azure Cosmos DB for distributed data types, and Automerge/Yjs for collaborative editing like Google Docs. SoundCloud uses CRDTs in their audio distribution platform.

Important Limitations

CRDTs only provide eventual consistency, NOT strong consistency or linearizability. Different replicas can see concurrent operations in different orders temporarily. Not all operations are naturally commutative, and CRDTs cannot solve problems requiring atomic coordination like preventing double-booking without additional mechanisms.

The “Shopping Cart Problem”: You can use an OR-Set for shopping cart items, but if two clients concurrently remove the same item, your naive implementation might remove both. The CRDT guarantees convergence to a consistent state, but that state might not match user expectations.

Byzantine fault tolerance is also a concern as traditional CRDTs assume all devices are trustworthy. Malicious devices can create permanent inconsistencies.

How the Simulator Works

The CRDT simulator demonstrates convergence through gossip-based replication. You can watch replicas diverge and converge as they exchange state.

Simulation Model:

  • Multiple replica nodes, each with independent CRDT state
  • Operations applied to random replicas (simulating distributed clients)
  • Periodic “merges” (gossip protocol) with probability merge_probability
  • Network delay between merges
  • Tracks convergence: do all replicas have identical state?

CRDT Implementations: Each CRDT type has its own semantics:

# G-Counter: Each replica has its own count, merge takes max
def merge(self, other):
    for replica_id, count in other.counts.items():
        self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)

# OR-Set: Elements have unique tags, add always beats remove
def add(self, element, unique_tag):
    self.elements[element].add(unique_tag)

def remove(self, element, observed_tags):
    self.elements[element] -= observed_tags  # Only remove what was observed

# LWW-Map: Latest timestamp wins
def set(self, key, value, timestamp):
    current = self.entries.get(key)
    if current is None or timestamp > current[1]:
        self.entries[key] = (value, timestamp, self.replica_id)

Key Parameters:

ParameterWhat It TestsValues
crdt_typeDifferent convergence semanticsG-Counter, PN-Counter, OR-Set, LWW-Map
n_replicasNumber of nodes2-8
n_operationsTotal updates10-100
merge_probabilityGossip frequency0.0-1.0
network_delayTime for state exchange0.0-2.0s

What You Can Learn:

  1. Convergence speed:
    • Set merge_probability=0.1 ? slow convergence, replicas stay diverged
    • Set merge_probability=0.8 ? fast convergence
    • Understand gossip frequency vs consistency window tradeoff
  2. OR-Set semantics:
    • Watch concurrent add/remove ? add wins
    • See how unique tags prevent unintended deletions
    • Compare with naive set implementation
  3. LWW-Map data loss:
    • Two replicas set same key concurrently with different values
    • One value “wins” based on timestamp (or replica ID tie-break)
    • Data loss is possible – not suitable for all use cases
  4. Network partition tolerance:
    • Low merge probability simulates partition
    • Replicas diverge but operations still succeed (AP in CAP)
    • After “partition heals” (merges resume), all converge
    • No coordination needed, no operations failed

The simulator visually shows replica states over time and convergence status, making abstract CRDT theory concrete.

Key Insights from Simulation

CRDTs trade immediate consistency for availability and partition tolerance. The theoretical guarantees are proven: if all replicas receive all updates (eventual delivery), they will converge to the same state (strong convergence).

But the simulator reveals the practical challenges:

  • Merge semantics don’t always match user intent (LWW can lose data)
  • Tombstones can grow indefinitely (OR-Set needs garbage collection)
  • Causal ordering adds complexity (need vector clocks for some CRDTs)
  • Not suitable for operations requiring coordination (uniqueness constraints, atomic updates)

When to use CRDTs:

  • High-write distributed counters (page views, analytics)
  • Collaborative editing (where eventual consistency is acceptable)
  • Offline-first applications (sync when online)
  • Shopping carts (with careful semantic design)

When NOT to use CRDTs:

  • Bank account balances (need atomic transactions)
  • Inventory (can’t prevent overselling without coordination)
  • Unique constraints (usernames, reservation systems)
  • Access control (need immediate consistency)

Part 4: Cross-System Interaction (CSI) Failures

Research from EuroSys 2023 found that 20% of catastrophic cloud incidents and 37% of failures in major open-source distributed systems are CSI failures – where both systems work correctly in isolation but fail when connected. This is the NASA Mars Climate Orbiter problem: one team used metric units, another used imperial. Both systems worked perfectly. The spacecraft burned up in Mars’s atmosphere because of their interaction.

Why CSI Failures Are Different

Not dependency failures: The downstream system is available, it just can’t process what upstream sends.

Not library bugs: Libraries are single-address-space and well-tested. CSI failures cross system boundaries where testing is expensive.

Not component failures: Each system passes its own test suite. The bug only emerges through interaction.

CSI failures manifest across three planes: Data plane (51% – schema/metadata mismatches), Management plane (32% – configuration incoherence), and Control plane (17% – API semantic violations).

For example, study of Apache Spark-Hive integration found 15 distinct discrepancies in simple write-read testing. Hive stored timestamps as long (milliseconds since epoch), Spark expected Timestamp type. Both worked in isolation, failed when integrated. Kafka and Flink encoding mismatch: Kafka set compression.type=lz4, Flink couldn’t decompress due to old LZ4 library. Configuration was silently ignored in Flink, leading to data corruption for 2 weeks before detection.

Why Testing Doesn’t Catch CSI Failures

Analysis of Spark found only 6% of integration tests actually test cross-system interaction. Most “integration tests” test multiple components of the same system. Cross-system testing is expensive and often skipped. The problem compounds with modern architectures:

  • Microservices: More system boundaries to test
  • Multi-cloud: Different clouds with different semantics
  • Serverless: Fine-grained composition increases interaction surface area

How the Simulator Works

The CSI failure simulator models two systems exchanging data, with configurable discrepancies in schemas, encodings, and configurations.

System Model:

  • Two systems (upstream ? downstream)
  • Each has its own schema definition (field types, encoding, nullable fields)
  • Each has its own configuration (timeouts, retry counts, etc.)
  • Data flows from System A to System B with potential conversion failures

Failure Scenarios:

  1. Metadata Mismatch (Hive/Spark):
    • System A: timestamp: long
    • System B: timestamp: Timestamp
    • Failure: Type coercion fails ~30% of the time
  2. Schema Conflict (Producer/Consumer):
    • System A: encoding: latin-1
    • System B: encoding: utf-8
    • Failure: Silent data corruption
  3. Configuration Incoherence (ServiceA/ServiceB):
    • System A: max_retries=3, timeout=30s
    • System B expects: max_retries=5, timeout=60s
    • Failure: ~40% of requests fail due to premature timeout
  4. API Semantic Violation (Upstream/Downstream):
    • Upstream assumes: synchronous, thread-safe
    • Downstream is: asynchronous, not thread-safe
    • Failure: Race conditions, out-of-order processing
  5. Type Confusion (SystemA/SystemB):
    • System A: amount: float
    • System B: amount: decimal
    • Failure: Precision loss in financial calculations

Implementation Details:

class DataSchema:
    def __init__(self, schema_id, fields, encoding, nullable_fields):
        self.fields = fields  # field_name -> type
        self.encoding = encoding
        
    def is_compatible(self, other):
        # Check field types and encoding
        return (self.fields == other.fields and 
                self.encoding == other.encoding)

class DataRecord:
    def serialize(self, target_schema):
        # Attempt type coercion
        for field, value in self.data.items():
            expected_type = target_schema.fields[field]
            actual_type = self.schema.fields[field]
            
            if expected_type != actual_type:
                # 30% failure on type mismatch (simulating real world)
                if random.random() < 0.3:
                    return None  # Serialization failure
        
        # Check encoding compatibility
        if self.schema.encoding != target_schema.encoding:
            if random.random() < 0.2:  # 20% silent corruption
                return None

Key Parameters:

ParameterWhat It Tests
failure_scenarioType of CSI failure (metadata, schema, config, API, type)
durationSimulation length
request_rateLoad (requests per second)

The simulator doesn’t have many tunable parameters because CSI failures are about specific incompatibilities, not gradual degradation. Each scenario models a real-world pattern.

What You Can Learn:

  1. Failure rates: CSI failures often manifest in 20-40% of requests (not 100%)
    • Some requests happen to have compatible data
    • Makes debugging harder (intermittent failures)
  2. Failure location:
    • Research shows 69% of CSI fixes go in the upstream system, often in connector modules that are less than 5% of the codebase
    • Simulator shows which system fails (usually downstream)
  3. Silent vs loud failures:
    • Type mismatches often crash (loud, easy to detect)
    • Encoding mismatches corrupt silently (hard to detect)
    • Config mismatches cause intermittent timeouts
  4. Prevention effectiveness:
    • Schema registry eliminates metadata mismatches
    • Configuration validation catches config incoherence
    • Contract testing prevents API semantic violations

Key Insights from Simulation

The simulator demonstrates that cross-system integration testing is essential but often skipped. Unit tests of each system won’t catch these failures.

Prevention strategies validated by simulation:

  1. Write-Read Testing: Write with System A, read with System B, verify integrity
  2. Schema Registry: Single source of truth for data schemas, enforced across systems
  3. Configuration Coherence Checking: Validate that shared configs match
  4. Contract Testing: Explicit, machine-checkable API contracts

Hybrid Consistency Models

Modern systems increasingly use mixed consistency: RedBlue Consistency (2012) marks operations as needing strong consistency (red) or eventual consistency (blue). Replicache (2024) has the server assign final total order while clients do optimistic local updates with rebase. For example: Calendar Application

# Strong consistency for room reservations (prevent double-booking)
def book_conference_room(room_id, time_slot):
    with transaction(consistency='STRONG'):
        if room.is_available(time_slot):
            room.book(time_slot)
            return True
        return False

# CRDTs for collaborative editing (participant lists, notes)
def update_meeting_notes(meeting_id, notes):
    # LWW-Map CRDT, eventual consistency
    meeting.notes.merge(notes)

# Eventual consistency for preferences
def update_user_calendar_color(user_id, color):
    # Who cares if this propagates slowly?
    user_prefs[user_id] = color

Recent theoretical work on the CALM theorem proves that coordination-free consistency is achievable for certain problem classes. Research in 2025 provided mathematical definitions of when coordination is and isn’t required, separating coordination from computation.

What the Simulators Teach Us

Running all four simulators reveals the consistency spectrum:

No “best” consistency model exists:

  • Quorums are best when you need linearizability and can tolerate latency
  • CRDTs are best when you need high availability and can tolerate eventual consistency
  • Neither approach “bypasses” CAP – they make different tradeoffs
  • Real systems use hybrid models with different consistency for different operations

Practical Lessons

1. Design for Recovery, Not Just Prevention

The metastable failure simulator shows you can’t prevent all failures. Your retry logic, backoff strategy, and circuit breakers are more important than your happy path code. Validated strategies include:

  • Exponential backoff with jitter (spread retries over time)
  • Adaptive retry budgets (limit total fleet-wide retries)
  • Circuit breakers (detect patterns, stop storms)
  • Load shedding (fail fast rather than queue to death)

2. Understand the Consistency Spectrum

The CAP/PACELC simulator demonstrates that consistency is not binary. You need to understand:

  • What consistency level do you actually need? (Most operations don’t need linearizability)
  • What’s the latency cost? (Quorum reads in cross-region deployment can be 100x slower)
  • What happens during partitions? (Can you sacrifice availability or must you serve stale data?)

Decision framework:

  • Use strong consistency for: money, inventory, locks, compliance
  • Use eventual consistency for: feeds, catalogs, analytics, caches
  • Use hybrid models for: most real-world applications

3. Test Cross-System Interactions

The CSI failure simulator reveals that 86% of fixes go into connector modules that are less than 5% of your codebase. This is where bugs hide. Essential tests include:

  • Write-read tests (write with System A, read with System B)
  • Round-trip tests (serialize/deserialize across boundaries)
  • Version compatibility matrix (test combinations)
  • Schema validation (machine-checkable contracts)

4. Leverage CRDTs Where Appropriate

The CRDT simulator shows that conflict-free convergence is possible for specific problem types. But you need to:

  • Understand the semantic limitations (LWW can lose data)
  • Design merge behavior carefully (does it match user intent?)
  • Handle garbage collection (tombstones, vector clocks)
  • Accept eventual consistency (not suitable for all use cases)

5. Monitor for Sustaining Effects

Metastability, retry storms, and goodput collapse are self-sustaining failure modes. They persist after the trigger is gone. Critical metrics include:

  • P99 latency vs timeout threshold (approaching timeout = danger)
  • Retry rate vs success rate (high retries = storm risk)
  • Queue depth (unbounded growth = admission control needed)
  • Goodput vs throughput (doing useful work vs spinning)

Using the Simulators

All four simulators are available at: https://github.com/bhatti/simulators

Installation

git clone https://github.com/bhatti/simulators
cd simulators
pip install -r requirements.txt

Requirements:

  • Python 3.7+
  • streamlit (web UI)
  • simpy (discrete event simulation)
  • plotly (interactive visualizations)
  • numpy, pandas (data analysis)

Running Individual Simulators

# Metastable failure simulator
streamlit run metastable_simulator.py

# CAP/PACELC consistency simulator
streamlit run cap_consistency_simulator.py

# CRDT simulator
streamlit run crdt_simulator.py

# CSI failure simulator
streamlit run csi_failure_simulator.py

Running All Simulators

python run_all_simulators.py

Conclusion

Building distributed systems means confronting failure modes that are expensive or impossible to reproduce in real environments:

  • Metastable failures require specific load patterns and timing
  • Consistency tradeoffs need multi-region deployments to observe
  • CRDT convergence requires orchestrating concurrent operations across replicas
  • CSI failures need exact schema/config mismatches that don’t exist in test environments

Simulators bridge the gap between theoretical understanding and practical intuition:

  1. Cheaper than production testing: No cloud costs, no multi-region setup, instant feedback
  2. Safer than production experiments: Crash the simulator, not your service
  3. More complete than unit tests: See emergent behaviors, not just component correctness
  4. Faster iteration: Tweak parameters, re-run in seconds, build intuition through experimentation

What You Can’t Learn Without Simulation

  • When does retry amplification tip into metastability? (Depends on coordination slope, timeout, backoff)
  • How much does quorum coordination actually cost? (Depends on network latency, replica count, workload)
  • Do your CRDT semantics match user expectations? (Depends on merge behavior, conflict resolution)
  • Will your schema changes break integration? (Depends on type coercion, encoding, version skew)

The goal isn’t to prevent all failures, that’s impossible. The goal is to understand, anticipate, and recover from the failures that will inevitably occur.


References

Key research papers and resources used in this post:

  1. AWS Metastability Research (HotOS 2025) – Sustaining effects and goodput collapse
  2. Marc Brooker on DSQL – Practical distributed SQL considerations
  3. James Hamilton on Reliable Systems – Large-scale system design
  4. CSI Failures Study (EuroSys 2023) – Cross-system interaction failures
  5. PACELC Framework – Beyond CAP theorem
  6. Marc Brooker on CAP – CAP theorem revisited
  7. Anna CRDT Database – Autoscaling with CRDTs
  8. Linearizability Paper – Herlihy & Wing’s foundational work
  9. Designing Data-Intensive Applications by Martin Kleppmann
  10. Distributed Systems Reading Group – MIT CSAIL
  11. Jepsen.io – Kyle Kingsbury’s consistency testing
  12. Aphyr’s blog – Distributed systems deep dives

November 7, 2025

Three Decades of Remote Calls: My Journey from COBOL Mainframes to AI Agents

Filed under: Computing,Web Services — admin @ 9:50 pm

Introduction

I started writing network code in the early 1990s on IBM mainframes, armed with nothing but Assembly and COBOL. Today, I build distributed AI agents using gRPC, RAG pipelines, and serverless functions. Between these worlds lie decades of technological evolution and an uncomfortable realization: we keep relearning the same lessons. Over the years, I’ve seen simple ideas triumph over complex ones. The technology keeps changing, but the problems stay the same. Network latency hasn’t gotten faster relative to CPU speed. Distributed systems are still hard. Complexity still kills projects. And every new generation has to learn that abstractions leak. I’ll show you the technologies I’ve used, the mistakes I’ve made, and most importantly, what the past teaches us about building better systems in the future.

The Mainframe Era

CICS and 3270 Terminals

I started my career on IBM mainframes running CICS, which was used to build online applications accessed through 3270 “green screen” terminals. It used LU6.2 (Logical Unit 6.2) protocol, part of IBM’s Systems Network Architecture (SNA) to provide peer-to-peer communication. Here’s what a typical CICS application looked like in COBOL:

IDENTIFICATION DIVISION.
PROGRAM-ID. CUSTOMER-INQUIRY.

DATA DIVISION.
WORKING-STORAGE SECTION.
01  CUSTOMER-REC.
    05  CUST-ID        PIC 9(8).
    05  CUST-NAME      PIC X(30).
    05  CUST-BALANCE   PIC 9(7)V99.

LINKAGE SECTION.
01  DFHCOMMAREA.
    05  COMM-CUST-ID   PIC 9(8).

PROCEDURE DIVISION.
    EXEC CICS
        RECEIVE MAP('CUSTMAP')
        MAPSET('CUSTSET')
        INTO(CUSTOMER-REC)
    END-EXEC.
    
    EXEC CICS
        READ FILE('CUSTFILE')
        INTO(CUSTOMER-REC)
        RIDFLD(COMM-CUST-ID)
    END-EXEC.
    
    EXEC CICS
        SEND MAP('RESULTMAP')
        MAPSET('CUSTSET')
        FROM(CUSTOMER-REC)
    END-EXEC.
    
    EXEC CICS RETURN END-EXEC.

The CICS environment handled all the complexity—transaction management, terminal I/O, file access, and inter-system communication. For the user interface, I used Basic Mapping Support (BMS), which was notoriously finicky. You had to define screen layouts in a rigid format specifying exactly where each field appeared on the 24×80 character grid:

CUSTMAP  DFHMSD TYPE=&SYSPARM,                                    X
               MODE=INOUT,                                        X
               LANG=COBOL,                                        X
               CTRL=FREEKB
         DFHMDI SIZE=(24,80)
CUSTID   DFHMDF POS=(05,20),                                      X
               LENGTH=08,                                         X
               ATTRB=(UNPROT,NUM),                                X
               INITIAL='________'
CUSTNAME DFHMDF POS=(07,20),                                      X
               LENGTH=30,                                         X
               ATTRB=PROT

This was so painful that I wrote my own tool to convert simple text-based UI templates into BMS format. Looking back, this was my first foray into creating developer tools. Key lesson I learned from the mainframe era was that developer experience mattered. Cumbersome tools slow down development and introduce errors.

Moving to UNIX

Berkeley Sockets

After working on mainframes for a couple of years, I saw the mainframes were already in decline and I then transitioned to C and UNIX systems, which I studied previously in my college. I learned about Berkeley Sockets, which was a lot more powerful and you had complete control over the network. Here’s a simple TCP server in C using Berkeley Sockets:

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define PORT 8080
#define BUFFER_SIZE 1024

int main() {
    int server_fd, client_fd;
    struct sockaddr_in server_addr, client_addr;
    socklen_t client_len = sizeof(client_addr);
    char buffer[BUFFER_SIZE];
    
    // Create socket
    server_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (server_fd < 0) {
        perror("socket failed");
        exit(EXIT_FAILURE);
    }
    
    // Set socket options to reuse address
    int opt = 1;
    if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, 
                   &opt, sizeof(opt)) < 0) {
        perror("setsockopt failed");
        exit(EXIT_FAILURE);
    }
    
    // Bind to address
    memset(&server_addr, 0, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_addr.s_addr = INADDR_ANY;
    server_addr.sin_port = htons(PORT);
    
    if (bind(server_fd, (struct sockaddr *)&server_addr, 
             sizeof(server_addr)) < 0) {
        perror("bind failed");
        exit(EXIT_FAILURE);
    }
    
    // Listen for connections
    if (listen(server_fd, 10) < 0) {
        perror("listen failed");
        exit(EXIT_FAILURE);
    }
    
    printf("Server listening on port %d\n", PORT);
    
    while (1) {
        // Accept connection
        client_fd = accept(server_fd, 
                          (struct sockaddr *)&client_addr, 
                          &client_len);
        if (client_fd < 0) {
            perror("accept failed");
            continue;
        }
        
        // Read request
        ssize_t bytes_read = recv(client_fd, buffer, 
                                  BUFFER_SIZE - 1, 0);
        if (bytes_read > 0) {
            buffer[bytes_read] = '\0';
            printf("Received: %s\n", buffer);
            
            // Send response
            const char *response = "Message received\n";
            send(client_fd, response, strlen(response), 0);
        }
        
        close(client_fd);
    }
    
    close(server_fd);
    return 0;
}

As you can see, you had to track a lot of housekeeping like socket creation, binding, listening, accepting, reading, writing, and meticulous error handling at every step. Memory management was entirely manual—forget to close() a file descriptor and you’d leak resources. If you make a mistake with recv() buffer sizes and you’d overflow memory. I also experimented with Fast Sockets from UC Berkeley, which used kernel bypass techniques for lower latency and offered better performance.

Key lesson I learned was that low-level control comes at a steep cost. The cognitive load of managing these details makes it nearly impossible to focus on business logic.

Sun RPC and XDR

When working for a physics lab with a large computing facilities consists of Sun workstations, Solaris, and SPARC processors, I discovered Sun RPC (Remote Procedure Call) with XDR (External Data Representation). XDR solved a critical problem: how do you exchange data between machines with different architectures? A SPARC processor uses big-endian byte ordering, while x86 uses little-endian. XDR provided a canonical, architecture-neutral format for representing data. Here’s an XDR definition file (types.x):

/* Define a structure for customer data */
struct customer {
    int customer_id;
    string name<30>;
    float balance;
};

/* Define the RPC program */
program CUSTOMER_PROG {
    version CUSTOMER_VERS {
        int ADD_CUSTOMER(customer) = 1;
        customer GET_CUSTOMER(int) = 2;
    } = 1;
} = 0x20000001;

You’d run rpcgen on this file:

$ rpcgen types.x

This generated the client stub, server stub, and XDR serialization code automatically. Here’s what the server implementation looked like:

#include "types.h"

int *add_customer_1_svc(customer *cust, struct svc_req *rqstp) {
    static int result;
    
    // Add customer to database
    printf("Adding customer: %s (ID: %d)\n", 
           cust->name, cust->customer_id);
    
    result = 1;  // Success
    return &result;
}

customer *get_customer_1_svc(int *cust_id, struct svc_req *rqstp) {
    static customer result;
    
    // Fetch from database
    result.customer_id = *cust_id;
    result.name = strdup("John Doe");
    result.balance = 1000.50;
    
    return &result;
}

And the client:

#include "types.h"

int main(int argc, char *argv[]) {
    CLIENT *clnt;
    customer cust;
    int *result;
    
    clnt = clnt_create("localhost", CUSTOMER_PROG, 
                       CUSTOMER_VERS, "tcp");
    if (clnt == NULL) {
        clnt_pcreateerror("localhost");
        exit(1);
    }
    
    // Call remote procedure
    cust.customer_id = 123;
    cust.name = "Alice Smith";
    cust.balance = 5000.00;
    
    result = add_customer_1(&cust, clnt);
    if (result == NULL) {
        clnt_perror(clnt, "call failed");
    }
    
    clnt_destroy(clnt);
    return 0;
}

This was my first introduction to Interface Definition Languages (IDL) and I found that defining the contract once and generating code automatically reduces errors. This pattern would reappear in CORBA, Protocol Buffers, and gRPC.

Parallel Computing

During my graduate and post-graduate studies in mid 1990s while working full time, I researched into the parallel and distributed computing. I worked with MPI (Message Passing Interface) and IBM’s MPL on SP1/SP2 systems. MPI provided collective operations like broadcast, scatter, gather, and reduce (predecessor to Hadoop like map/reduce). Here’s a simple MPI example that computes the sum of an array in parallel:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

#define ARRAY_SIZE 1000

int main(int argc, char** argv) {
    int rank, size;
    int data[ARRAY_SIZE];
    int local_sum = 0, global_sum = 0;
    int chunk_size, start, end;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    // Initialize data on root
    if (rank == 0) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            data[i] = i + 1;
        }
    }
    
    // Broadcast data to all processes
    MPI_Bcast(data, ARRAY_SIZE, MPI_INT, 0, MPI_COMM_WORLD);
    
    // Each process computes sum of its chunk
    chunk_size = ARRAY_SIZE / size;
    start = rank * chunk_size;
    end = (rank == size - 1) ? ARRAY_SIZE : start + chunk_size;
    
    for (int i = start; i < end; i++) {
        local_sum += data[i];
    }
    
    // Reduce all local sums to global sum
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, 
               MPI_SUM, 0, MPI_COMM_WORLD);
    
    if (rank == 0) {
        printf("Global sum: %d\n", global_sum);
    }
    
    MPI_Finalize();
    return 0;
}

For my post-graduate project, I built JavaNOW (Java on Networks of Workstations), which was inspired by Linda’s tuple spaces and MPI’s collective operations, but implemented in pure Java for portability. The key innovation was our Actor-inspired model. Instead of heavyweight processes communicating through message passing, I used lightweight Java threads with an Entity Space (distributed associative memory) where “actors” could put and get entities asynchronously. Here’s a simple example:

public class SumTask extends ActiveEntity {
    public Object execute(Object arg, JavaNOWAPI api) {
        Integer myId = (Integer) arg;
        EntitySpace workspace = new EntitySpace("RESULTS");
        
        // Compute partial sum
        int partialSum = 0;
        for (int i = myId * 100; i < (myId + 1) * 100; i++) {
            partialSum += i;
        }
        
        // Store result in EntitySpace
        return new Integer(partialSum);
    }
}

// Main application
public class ParallelSum extends JavaNOWApplication {
    public void master() {
        EntitySpace workspace = new EntitySpace("RESULTS");
        
        // Spawn parallel tasks
        for (int i = 0; i < 10; i++) {
            ActiveEntity task = new SumTask(new Integer(i));
            getJavaNOWAPI().eval(workspace, task, new Integer(i));
        }
        
        // Collect results
        int totalSum = 0;
        for (int i = 0; i < 10; i++) {
            Entity result = getJavaNOWAPI().get(
                workspace, new Entity(new Integer(i)));
            totalSum += ((Integer)result.getEntityValue()).intValue();
        }
        
        System.out.println("Total sum: " + totalSum);
    }
    
    public void slave(int id) {
        // Slave nodes wait for work
    }
}

Since then, I have seen the Actor model have gained a wide adoption. For example, today’s serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) and modern frameworks like Akka, Orleans, and Dapr all embrace Actor-inspired patterns.

Novell and CGI

I also briefly worked with Novell’s IPX (Internetwork Packet Exchange) protocol, which had painful APIs. Here’s a taste of IPX socket programming (simplified):

#include <nwcalls.h>
#include <nwipxspx.h>

int main() {
    IPXAddress server_addr;
    IPXPacket packet;
    WORD socket_number = 0x4000;
    
    // Open IPX socket
    IPXOpenSocket(socket_number, 0);
    
    // Setup address
    memset(&server_addr, 0, sizeof(IPXAddress));
    memcpy(server_addr.network, target_network, 4);
    memcpy(server_addr.node, target_node, 6);
    server_addr.socket = htons(socket_number);
    
    // Send packet
    packet.packetType = 4;  // IPX packet type
    memcpy(packet.data, "Hello", 5);
    IPXSendPacket(socket_number, &server_addr, &packet);
    
    IPXCloseSocket(socket_number);
    return 0;
}

Early Web Development with CGI

When the web emerged in early 1990s, I built applications using CGI (Common Gateway Interface) with Perl and C. I deployed these on Apache HTTP Server, which was the first production-quality open source web server and quickly became the dominant web server of the 1990s. Apache used process-driven concurrency where it forked a new process for each request or maintained a pool of pre-forked processes. CGI was conceptually simple: the web server launched a new UNIX process for every request, passing input via stdin and receiving output via stdout. Here’s a simple Perl CGI script:

#!/usr/bin/perl
use strict;
use warnings;
use CGI;

my $cgi = CGI->new;

print $cgi->header('text/html');
print "<html><body>\n";
print "<h1>Hello from CGI!</h1>\n";

my $name = $cgi->param('name') || 'Guest';
print "<p>Welcome, $name!</p>\n";

# Simulate database query
my $user_count = 42;
print "<p>Total users: $user_count</p>\n";

print "</body></html>\n";

And in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    char *query_string = getenv("QUERY_STRING");
    
    printf("Content-Type: text/html\n\n");
    printf("<html><body>\n");
    printf("<h1>CGI in C</h1>\n");
    
    if (query_string) {
        printf("<p>Query string: %s</p>\n", query_string);
    }
    
    printf("</body></html>\n");
    return 0;
}

Later, I migrated to more performant servers: Tomcat for Java servlets, Jetty as an embedded server, and Netty for building custom high-performance network applications. These servers used asynchronous I/O and lightweight threads (or even non-blocking event loops in Netty‘s case).

Key Lesson I learned was that scalability matters. The CGI model’s inability to maintain persistent connections or share state made it unsuitable for modern web applications. The shift from process-per-request to thread pools and then to async I/O represented fundamental improvements in how we handle concurrency.

Java Adoption

When Java was released in 1995, I adopted it wholeheartedly. It saved developers from manual memory management using malloc() and free() debugging. Network programming became far more approachable:

import java.io.*;
import java.net.*;

public class SimpleServer {
    public static void main(String[] args) throws IOException {
        int port = 8080;
        
        try (ServerSocket serverSocket = new ServerSocket(port)) {
            System.out.println("Server listening on port " + port);
            
            while (true) {
                try (Socket clientSocket = serverSocket.accept();
                     BufferedReader in = new BufferedReader(
                         new InputStreamReader(clientSocket.getInputStream()));
                     PrintWriter out = new PrintWriter(
                         clientSocket.getOutputStream(), true)) {
                    
                    String request = in.readLine();
                    System.out.println("Received: " + request);
                    
                    out.println("Message received");
                }
            }
        }
    }
}

Java Threads

I had previously used pthreads in C, which were hard to use but Java’s threading model was far simpler:

public class ConcurrentServer {
    public static void main(String[] args) throws IOException {
        ServerSocket serverSocket = new ServerSocket(8080);
        
        while (true) {
            Socket clientSocket = serverSocket.accept();
            
            // Spawn thread to handle client
            new Thread(new ClientHandler(clientSocket)).start();
        }
    }
    
    static class ClientHandler implements Runnable {
        private Socket socket;
        
        public ClientHandler(Socket socket) {
            this.socket = socket;
        }
        
        public void run() {
            try (BufferedReader in = new BufferedReader(
                     new InputStreamReader(socket.getInputStream()));
                 PrintWriter out = new PrintWriter(
                     socket.getOutputStream(), true)) {
                
                String request = in.readLine();
                // Process request
                out.println("Response");
                
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try { socket.close(); } catch (IOException e) {}
            }
        }
    }
}

Java’s synchronized keyword simplified thread-safe programming:

public class ThreadSafeCounter {
    private int count = 0;
    
    public synchronized void increment() {
        count++;
    }
    
    public synchronized int getCount() {
        return count;
    }
}

This was so much easier than managing mutexes, condition variables, and semaphores in C!

Java RMI: Remote Objects Made

When Java added RMI (1997), distributed objects became practical. You could invoke methods on objects running on remote machines almost as if they were local. Define a remote interface:

import java.rmi.Remote;
import java.rmi.RemoteException;

public interface Calculator extends Remote {
    int add(int a, int b) throws RemoteException;
    int multiply(int a, int b) throws RemoteException;
}

Implement it:

import java.rmi.server.UnicastRemoteObject;
import java.rmi.RemoteException;

public class CalculatorImpl extends UnicastRemoteObject 
                            implements Calculator {
    
    public CalculatorImpl() throws RemoteException {
        super();
    }
    
    public int add(int a, int b) throws RemoteException {
        return a + b;
    }
    
    public int multiply(int a, int b) throws RemoteException {
        return a * b;
    }
}

Server:

import java.rmi.Naming;
import java.rmi.registry.LocateRegistry;

public class Server {
    public static void main(String[] args) {
        try {
            LocateRegistry.createRegistry(1099);
            Calculator calc = new CalculatorImpl();
            Naming.rebind("Calculator", calc);
            System.out.println("Server ready");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Client:

import java.rmi.Naming;

public class Client {
    public static void main(String[] args) {
        try {
            Calculator calc = (Calculator) Naming.lookup(
                "rmi://localhost/Calculator");
            
            int result = calc.add(5, 3);
            System.out.println("5 + 3 = " + result);
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I found that RMI was constrained and everything had to extend Remote, and you were stuck with Java-to-Java communication. Key lesson I learned was that abstractions that feel natural to developers get adopted.

JINI: RMI with Service Discovery

At a travel booking company in the mid 2000s, I used JINI, which Sun Microsystems pitched as “RMI on steroids.” JINI extended RMI with automatic service discovery, leasing, and distributed events. The core idea: services could join a network, advertise themselves, and be discovered by clients without hardcoded locations. Here’s a JINI service interface and registration:

import net.jini.core.lookup.ServiceRegistrar;
import net.jini.discovery.LookupDiscovery;
import net.jini.lease.LeaseRenewalManager;
import java.rmi.Remote;
import java.rmi.RemoteException;

// Service interface
public interface BookingService extends Remote {
    String searchFlights(String origin, String destination) 
        throws RemoteException;
    boolean bookFlight(String flightId, String passenger) 
        throws RemoteException;
}

// Service provider
public class BookingServiceProvider implements DiscoveryListener {
    
    public void discovered(DiscoveryEvent event) {
        ServiceRegistrar[] registrars = event.getRegistrars();
        
        for (ServiceRegistrar registrar : registrars) {
            try {
                BookingService service = new BookingServiceImpl();
                Entry[] attributes = new Entry[] {
                    new Name("FlightBookingService")
                };
                
                ServiceItem item = new ServiceItem(null, service, attributes);
                ServiceRegistration reg = registrar.register(
                    item, Lease.FOREVER);
                
                // Auto-renew lease
                leaseManager.renewUntil(reg.getLease(), Lease.FOREVER, null);
                
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Client discovery and usage:

public class BookingClient implements DiscoveryListener {
    
    public void discovered(DiscoveryEvent event) {
        ServiceRegistrar[] registrars = event.getRegistrars();
        
        for (ServiceRegistrar registrar : registrars) {
            try {
                ServiceTemplate template = new ServiceTemplate(
                    null, new Class[] { BookingService.class }, null);
                
                ServiceItem item = registrar.lookup(template);
                
                if (item != null) {
                    BookingService booking = (BookingService) item.service;
                    String flights = booking.searchFlights("SFO", "NYC");
                    booking.bookFlight("FL123", "John Smith");
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Though, JINI provided automatic discovery, leasing and location transparency but it was too complex and only supported Java ecosystem. The ideas were sound and reappeared later in service meshes (Consul, Eureka) and Kubernetes service discovery. I learned that service discovery is essential for dynamic systems, but the implementation must be simple.

CORBA

I used CORBA (Common Object Request Broker Architecture) for many years in 1990s when building intelligent traffic Systems. CORBA promised the language-independent, platform-independent distributed objects. You could write a service in C++, invoke it from Java, and have clients in Python using the same IDL. Here’s a simple CORBA IDL definition:

module TrafficMonitor {
    struct SensorData {
        long sensor_id;
        float speed;
        long timestamp;
    };
    
    typedef sequence<SensorData> SensorDataList;
    
    interface TrafficService {
        void reportData(in SensorData data);
        SensorDataList getRecentData(in long minutes);
        float getAverageSpeed();
    };
};

Run the IDL compiler:

$ idl traffic.idl

This generated client stubs and server skeletons for your target language. I built a message-oriented middleware (MOM) system with CORBA that collected traffic data from road sensors and provided real-time traffic information.

C++ server implementation:

#include "TrafficService_impl.h"
#include <iostream>
#include <vector>

class TrafficServiceImpl : public POA_TrafficMonitor::TrafficService {
private:
    std::vector<TrafficMonitor::SensorData> data_store;
    
public:
    void reportData(const TrafficMonitor::SensorData& data) {
        data_store.push_back(data);
        std::cout << "Received data from sensor " 
                  << data.sensor_id << std::endl;
    }
    
    TrafficMonitor::SensorDataList* getRecentData(CORBA::Long minutes) {
        TrafficMonitor::SensorDataList* result = 
            new TrafficMonitor::SensorDataList();
        
        // Filter data from last N minutes
        time_t cutoff = time(NULL) - (minutes * 60);
        for (const auto& entry : data_store) {
            if (entry.timestamp >= cutoff) {
                result->length(result->length() + 1);
                (*result)[result->length() - 1] = entry;
            }
        }
        return result;
    }
    
    CORBA::Float getAverageSpeed() {
        if (data_store.empty()) return 0.0;
        
        float sum = 0.0;
        for (const auto& entry : data_store) {
            sum += entry.speed;
        }
        return sum / data_store.size();
    }
};

Java client:

import org.omg.CORBA.*;
import TrafficMonitor.*;

public class TrafficClient {
    public static void main(String[] args) {
        try {
            // Initialize ORB
            ORB orb = ORB.init(args, null);
            
            // Get reference to service
            org.omg.CORBA.Object obj = 
                orb.string_to_object("corbaname::localhost:1050#TrafficService");
            TrafficService service = TrafficServiceHelper.narrow(obj);
            
            // Report sensor data
            SensorData data = new SensorData();
            data.sensor_id = 101;
            data.speed = 65.5f;
            data.timestamp = (int)(System.currentTimeMillis() / 1000);
            
            service.reportData(data);
            
            // Get average speed
            float avgSpeed = service.getAverageSpeed();
            System.out.println("Average speed: " + avgSpeed + " mph");
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

However, CORBA specification was massive and different ORB (Object Request Broker) implementations like Orbix, ORBacus, and TAO couldn’t reliably interoperate despite claiming CORBA compliance. The binary protocol, IIOP, had subtle incompatibilities. CORBA did introduce valuable concepts:

  • Interceptors for cross-cutting concerns (authentication, logging, monitoring)
  • IDL-first design that forced clear interface definitions
  • Language-neutral protocols that actually worked (sometimes)

I learned that standards designed by committee are often over-engineer. CORBA, SOAP tried to solve every problem for everyone and ended up being optimal for no one.

SOAP and WSDL

I used SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language) on a number of projects in early 2000s that emerged as the standard for web services. The pitch: XML-based, platform-neutral, and “simple.” Here’s a WSDL definition:

<?xml version="1.0"?>
<definitions name="CustomerService"
   targetNamespace="http://example.com/customer"
   xmlns="http://schemas.xmlsoap.org/wsdl/"
   xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
   xmlns:tns="http://example.com/customer"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   
   <types>
      <xsd:schema targetNamespace="http://example.com/customer">
         <xsd:complexType name="Customer">
            <xsd:sequence>
               <xsd:element name="id" type="xsd:int"/>
               <xsd:element name="name" type="xsd:string"/>
               <xsd:element name="balance" type="xsd:double"/>
            </xsd:sequence>
         </xsd:complexType>
      </xsd:schema>
   </types>
   
   <message name="GetCustomerRequest">
      <part name="customerId" type="xsd:int"/>
   </message>
   
   <message name="GetCustomerResponse">
      <part name="customer" type="tns:Customer"/>
   </message>
   
   <portType name="CustomerPortType">
      <operation name="getCustomer">
         <input message="tns:GetCustomerRequest"/>
         <output message="tns:GetCustomerResponse"/>
      </operation>
   </portType>
   
   <binding name="CustomerBinding" type="tns:CustomerPortType">
      <soap:binding transport="http://schemas.xmlsoap.org/soap/http"/>
      <operation name="getCustomer">
         <soap:operation soapAction="getCustomer"/>
         <input>
            <soap:body use="literal"/>
         </input>
         <output>
            <soap:body use="literal"/>
         </output>
      </operation>
   </binding>
   
   <service name="CustomerService">
      <port name="CustomerPort" binding="tns:CustomerBinding">
         <soap:address location="http://example.com/customer"/>
      </port>
   </service>
</definitions>

A SOAP request looked like this:

<?xml version="1.0"?>
<soap:Envelope 
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cust="http://example.com/customer">
  <soap:Header>
    <cust:Authentication>
      <cust:username>john</cust:username>
      <cust:password>secret</cust:password>
    </cust:Authentication>
  </soap:Header>
  <soap:Body>
    <cust:getCustomer>
      <cust:customerId>12345</cust:customerId>
    </cust:getCustomer>
  </soap:Body>
</soap:Envelope>

The response:

<?xml version="1.0"?>
<soap:Envelope 
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cust="http://example.com/customer">
  <soap:Body>
    <cust:getCustomerResponse>
      <cust:customer>
        <cust:id>12345</cust:id>
        <cust:name>John Smith</cust:name>
        <cust:balance>5000.00</cust:balance>
      </cust:customer>
    </cust:getCustomerResponse>
  </soap:Body>
</soap:Envelope>

You can look at all that XML overhead! A simple request became hundreds of bytes of markup. As SOAP was designed by committee (IBM, Oracle, Microsoft), it tried to solve every possible enterprise problem: transactions, security, reliability, routing, orchestration. I learned that simplicity beats features and SOAP collapsed under its own weight.

Java Servlets and Filters

With Java 1.1, it added support for Servlets that provided a much better model than CGI. Instead of spawning a process per request, servlets were Java classes instantiated once and reused across requests:

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;

public class CustomerServlet extends HttpServlet {
    
    protected void doGet(HttpServletRequest request, 
                        HttpServletResponse response)
            throws ServletException, IOException {
        
        String customerId = request.getParameter("id");
        
        response.setContentType("application/json");
        PrintWriter out = response.getWriter();
        
        // Fetch customer data
        Customer customer = getCustomerFromDatabase(customerId);
        
        if (customer != null) {
            out.println(String.format(
                "{\"id\": \"%s\", \"name\": \"%s\", \"balance\": %.2f}",
                customer.getId(), customer.getName(), customer.getBalance()
            ));
        } else {
            response.setStatus(HttpServletResponse.SC_NOT_FOUND);
            out.println("{\"error\": \"Customer not found\"}");
        }
    }
    
    protected void doPost(HttpServletRequest request, 
                         HttpServletResponse response)
            throws ServletException, IOException {
        
        BufferedReader reader = request.getReader();
        StringBuilder json = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            json.append(line);
        }
        
        // Parse JSON and create customer
        Customer customer = parseJsonToCustomer(json.toString());
        saveCustomerToDatabase(customer);
        
        response.setStatus(HttpServletResponse.SC_CREATED);
        response.setContentType("application/json");
        PrintWriter out = response.getWriter();
        out.println(json.toString());
    }
}

Servlet Filters

The Filter API with Java Servlets was quite powerful and it supported a chain-of-responsibility pattern for handling cross-cutting concerns:

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.IOException;

public class AuthenticationFilter implements Filter {
    
    public void doFilter(ServletRequest request, 
                        ServletResponse response,
                        FilterChain chain) 
            throws IOException, ServletException {
        
        HttpServletRequest httpRequest = (HttpServletRequest) request;
        HttpServletResponse httpResponse = (HttpServletResponse) response;
        
        // Check for authentication token
        String token = httpRequest.getHeader("Authorization");
        
        if (token == null || !isValidToken(token)) {
            httpResponse.setStatus(HttpServletResponse.SC_UNAUTHORIZED);
            httpResponse.getWriter().println("{\"error\": \"Unauthorized\"}");
            return;
        }
        
        // Pass to next filter or servlet
        chain.doFilter(request, response);
    }
    
    private boolean isValidToken(String token) {
        // Validate token
        return token.startsWith("Bearer ") && 
               validateJWT(token.substring(7));
    }
}

Configuration in web.xml:

<filter>
    <filter-name>AuthenticationFilter</filter-name>
    <filter-class>com.example.AuthenticationFilter</filter-class>
</filter>

<filter-mapping>
    <filter-name>AuthenticationFilter</filter-name>
    <url-pattern>/api/*</url-pattern>
</filter-mapping>

You could chain filters for compression, logging, transformation, rate limiting with clean separation of concerns without touching business logic. I previously had experienced with CORBA interceptors for injecting cross-cutting business logic and the filter pattern solved similar cross-cutting concerns problem. This pattern would reappear in service meshes and API gateways.

Enterprise Java Beans

I used Enterprise Java Beans (EJB) in late 1990s and early 2000s that attempted to make distributed objects transparent. Its key idea was that use regular Java objects and let the application server handle all the distribution, persistence, transactions, and security. Here’s what an EJB 2.x entity bean looked like:

// Remote interface
public interface Customer extends EJBObject {
    String getName() throws RemoteException;
    void setName(String name) throws RemoteException;
    double getBalance() throws RemoteException;
    void setBalance(double balance) throws RemoteException;
}

// Home interface
public interface CustomerHome extends EJBHome {
    Customer create(Integer id, String name) throws CreateException, RemoteException;
    Customer findByPrimaryKey(Integer id) throws FinderException, RemoteException;
}

// Bean implementation
public class CustomerBean implements EntityBean {
    private Integer id;
    private String name;
    private double balance;
    
    public String getName() { return name; }
    public void setName(String name) { this.name = name; }
    public double getBalance() { return balance; }
    public void setBalance(double balance) { this.balance = balance; }
    
    // Container callbacks
    public void ejbActivate() {}
    public void ejbPassivate() {}
    public void ejbLoad() {}
    public void ejbStore() {}
    public void setEntityContext(EntityContext ctx) {}
    public void unsetEntityContext() {}
    
    public Integer ejbCreate(Integer id, String name) {
        this.id = id;
        this.name = name;
        this.balance = 0.0;
        return null;
    }
    
    public void ejbPostCreate(Integer id, String name) {}
}

The N+1 Selects Problem and Network Fallacy

The fatal flaw: EJB pretended network calls were free. I watched teams write code like this:

CustomerHome home = // ... lookup
Customer customer = home.findByPrimaryKey(customerId);

// Each getter is a remote call!
String name = customer.getName();        // Network call
double balance = customer.getBalance();  // Network call

Worse, I saw code that made remote calls in loops:

Collection customers = home.findAll();
double totalBalance = 0.0;
for (Customer customer : customers) {
    // Remote call for EVERY iteration!
    totalBalance += customer.getBalance();
}

This violated the first Fallacy of Distributed Computing: The network is reliable. It’s also not zero latency. What looked like simple property access actually made HTTP calls to a remote server. I had previously built distributed and parallel applications, so I understood network latency. But it blindsided most developers because EJB deliberately hid it.

I learned that you can’t hide distribution. Network calls are fundamentally different from local calls. Latency, failure modes, and semantics are different. Transparency is a lie.

REST Standard

Before REST became mainstream, I experimented with “Plain Old XML” (POX) over HTTP by just sending XML documents via HTTP POST without all the SOAP ceremony:

import requests
import xml.etree.ElementTree as ET

# Create XML request
root = ET.Element('getCustomer')
ET.SubElement(root, 'customerId').text = '12345'
xml_data = ET.tostring(root, encoding='utf-8')

# Send HTTP POST
response = requests.post(
    'http://api.example.com/customer',
    data=xml_data,
    headers={'Content-Type': 'application/xml'}
)

# Parse response
response_tree = ET.fromstring(response.content)
name = response_tree.find('name').text

This was simpler than SOAP, but still ad-hoc. Then REST (Representational State Transfer), based on Roy Fielding’s 2000 dissertation offered a principled approach:

  • Use HTTP methods semantically (GET, POST, PUT, DELETE)
  • Resources have URLs
  • Stateless communication
  • Hypermedia as the engine of application state (HATEOAS)

Here’s a RESTful API in Python with Flask:

from flask import Flask, jsonify, request

app = Flask(__name__)

# In-memory data store
customers = {
    '12345': {'id': '12345', 'name': 'John Smith', 'balance': 5000.00}
}

@app.route('/customers/<customer_id>', methods=['GET'])
def get_customer(customer_id):
    customer = customers.get(customer_id)
    if customer:
        return jsonify(customer), 200
    return jsonify({'error': 'Customer not found'}), 404

@app.route('/customers', methods=['POST'])
def create_customer():
    data = request.get_json()
    customer_id = data['id']
    customers[customer_id] = data
    return jsonify(data), 201

@app.route('/customers/<customer_id>', methods=['PUT'])
def update_customer(customer_id):
    if customer_id not in customers:
        return jsonify({'error': 'Customer not found'}), 404
    
    data = request.get_json()
    customers[customer_id].update(data)
    return jsonify(customers[customer_id]), 200

@app.route('/customers/<customer_id>', methods=['DELETE'])
def delete_customer(customer_id):
    if customer_id in customers:
        del customers[customer_id]
        return '', 204
    return jsonify({'error': 'Customer not found'}), 404

if __name__ == '__main__':
    app.run(debug=True)

Client code became trivial:

import requests

# GET customer
response = requests.get('http://localhost:5000/customers/12345')
if response.status_code == 200:
    customer = response.json()
    print(f"Customer: {customer['name']}")

# Create new customer
new_customer = {
    'id': '67890',
    'name': 'Alice Johnson',
    'balance': 3000.00
}
response = requests.post(
    'http://localhost:5000/customers',
    json=new_customer
)

# Update customer
update_data = {'balance': 3500.00}
response = requests.put(
    'http://localhost:5000/customers/67890',
    json=update_data
)

# Delete customer
response = requests.delete('http://localhost:5000/customers/67890')

Hypermedia and HATEOAS

True REST embraced hypermedia—responses included links to related resources:

{
  "id": "12345",
  "name": "John Smith",
  "balance": 5000.00,
  "_links": {
    "self": {"href": "/customers/12345"},
    "orders": {"href": "/customers/12345/orders"},
    "transactions": {"href": "/customers/12345/transactions"}
  }
}

In practice, most APIs called “REST” weren’t truly RESTful and didn’t implement HATEOAS or use HTTP status codes correctly. But even “REST-ish” APIs were far simpler than SOAP. Key lesson I leared was that REST succeeded because it built on HTTP, something every platform already supported. No new protocols, no complex tooling. Just URLs, HTTP verbs, and JSON.

JSON Replaces XML

With adoption of REST, I saw a decline of XML Web Services (JAX-WS) and I used JAX-RS for REST services that supported JSON payload. XML required verbose markup:

<?xml version="1.0"?>
<customer>
    <id>12345</id>
    <name>John Smith</name>
    <balance>5000.00</balance>
    <orders>
        <order>
            <id>001</id>
            <date>2024-01-15</date>
            <total>99.99</total>
        </order>
        <order>
            <id>002</id>
            <date>2024-02-20</date>
            <total>149.50</total>
        </order>
    </orders>
</customer>

The same data in JSON:

{
  "id": "12345",
  "name": "John Smith",
  "balance": 5000.00,
  "orders": [
    {
      "id": "001",
      "date": "2024-01-15",
      "total": 99.99
    },
    {
      "id": "002",
      "date": "2024-02-20",
      "total": 149.50
    }
  ]
}

JSON does have limitations. It doesn’t natively support references or circular structures, making recursive relationships awkward:

{
  "id": "A",
  "children": [
    {
      "id": "B",
      "parent_id": "A"
    }
  ]
}

You have to encode references manually, unlike some XML schemas that support IDREF.

Erlang/OTP

I learned about actor model in college and built a framework based on actors and Linda memory model. In the mid-2000s, I encountered Erlang that used actors for building distributed systems. Erlang was designed in the 1980s at Ericsson for building telecom switches and is based on following design:

  • “Let it crash” philosophy
  • No shared memory between processes
  • Lightweight processes (not OS threads—Erlang processes)
  • Supervision trees for fault recovery
  • Hot code swapping for zero-downtime updates

Here’s what an Erlang actor (process) looks like:

-module(customer_server).
-export([start/0, init/0, get_customer/1, update_balance/2]).

% Start the server
start() ->
    Pid = spawn(customer_server, init, []),
    register(customer_server, Pid),
    Pid.

% Initialize with empty state
init() ->
    State = #{},  % Empty map
    loop(State).

% Main loop - handle messages
loop(State) ->
    receive
        {get_customer, CustomerId, From} ->
            Customer = maps:get(CustomerId, State, not_found),
            From ! {customer, Customer},
            loop(State);
        
        {update_balance, CustomerId, NewBalance, From} ->
            Customer = maps:get(CustomerId, State),
            UpdatedCustomer = Customer#{balance => NewBalance},
            NewState = maps:put(CustomerId, UpdatedCustomer, State),
            From ! {ok, updated},
            loop(NewState);
        
        {add_customer, CustomerId, Customer, From} ->
            NewState = maps:put(CustomerId, Customer, State),
            From ! {ok, added},
            loop(NewState);
        
        stop ->
            ok;
        
        _ ->
            loop(State)
    end.

% Client functions
get_customer(CustomerId) ->
    customer_server ! {get_customer, CustomerId, self()},
    receive
        {customer, Customer} -> Customer
    after 5000 ->
        timeout
    end.

update_balance(CustomerId, NewBalance) ->
    customer_server ! {update_balance, CustomerId, NewBalance, self()},
    receive
        {ok, updated} -> ok
    after 5000 ->
        timeout
    end.

Erlang made concurrency became simple by using messaging passing with actors.

The Supervision Tree

A key innovation of Erlang was supervision trees. You organized processes in a hierarchy, and supervisors would restart crashed children:

-module(customer_supervisor).
-behaviour(supervisor).

-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    % Supervisor strategy
    SupFlags = #{
        strategy => one_for_one,  % Restart only failed child
        intensity => 5,            % Max 5 restarts
        period => 60               % Per 60 seconds
    },
    
    % Child specifications
    ChildSpecs = [
        #{
            id => customer_server,
            start => {customer_server, start, []},
            restart => permanent,   % Always restart
            shutdown => 5000,
            type => worker,
            modules => [customer_server]
        },
        #{
            id => order_server,
            start => {order_server, start, []},
            restart => permanent,
            shutdown => 5000,
            type => worker,
            modules => [order_server]
        }
    ],
    
    {ok, {SupFlags, ChildSpecs}}.

If a process crashed, the supervisor automatically restarted it and the system self-healed. A key lesson I learned from actor model and Erlang was that a shared mutable state is the enemy. Message passing with isolated state is simpler, more reliable, and easier to reason about. Today, AWS Lambda, Azure Durable Functions, and frameworks like Akka all embrace the Actor model.

Distributed Erlang

Erlang made distributed computing almost trivial. Processes on different nodes communicated identically to local processes:

% On node1@host1
RemotePid = spawn('node2@host2', module, function, [args]),
RemotePid ! {message, data}.

% On node2@host2 - receives the message
receive
    {message, Data} -> 
        io:format("Received: ~p~n", [Data])
end.

The VM handled all the complexity of node discovery, connection management, and message routing. Today’s serverless functions are actors and kubernetes pods are supervised processes.

Asynchronous Messaging

As systems grew more complex, asynchronous messaging became essential. I worked extensively with Oracle Tuxedo, IBM MQSeries, WebLogic JMS, WebSphere MQ, and later ActiveMQ, MQTT / AMQP, ZeroMQ and RabbitMQ primarily for inter-service communication and asynchronous processing. Here’s a JMS producer in Java:

import javax.jms.*;
import javax.naming.*;

public class OrderProducer {
    public static void main(String[] args) throws Exception {
        Context ctx = new InitialContext();
        ConnectionFactory factory = 
            (ConnectionFactory) ctx.lookup("ConnectionFactory");
        Queue queue = (Queue) ctx.lookup("OrderQueue");
        
        Connection connection = factory.createConnection();
        Session session = connection.createSession(
            false, Session.AUTO_ACKNOWLEDGE);
        MessageProducer producer = session.createProducer(queue);
        
        // Create message
        TextMessage message = session.createTextMessage();
        message.setText("{ \"orderId\": \"12345\", " +
                       "\"customerId\": \"67890\", " +
                       "\"amount\": 99.99 }");
        
        // Send message
        producer.send(message);
        System.out.println("Order sent: " + message.getText());
        
        connection.close();
    }
}

JMS consumer:

import javax.jms.*;
import javax.naming.*;

public class OrderConsumer implements MessageListener {
    
    public static void main(String[] args) throws Exception {
        Context ctx = new InitialContext();
        ConnectionFactory factory = 
            (ConnectionFactory) ctx.lookup("ConnectionFactory");
        Queue queue = (Queue) ctx.lookup("OrderQueue");
        
        Connection connection = factory.createConnection();
        Session session = connection.createSession(
            false, Session.AUTO_ACKNOWLEDGE);
        MessageConsumer consumer = session.createConsumer(queue);
        
        consumer.setMessageListener(new OrderConsumer());
        connection.start();
        
        System.out.println("Waiting for messages...");
        Thread.sleep(Long.MAX_VALUE);  // Keep running
    }
    
    public void onMessage(Message message) {
        try {
            TextMessage textMessage = (TextMessage) message;
            System.out.println("Received order: " + 
                             textMessage.getText());
            
            // Process order
            processOrder(textMessage.getText());
            
        } catch (JMSException e) {
            e.printStackTrace();
        }
    }
    
    private void processOrder(String orderJson) {
        // Business logic here
    }
}

Asynchronous messaging is essential for building resilient, scalable systems. It decouples producers from consumers, provides natural backpressure, and enables event-driven architectures.

Spring Framework and Aspect-Oriented Programming

In early 2000, I used aspect oriented programming (AOP) to inject cross cutting concerns like logging, security, monitoring, etc. Here is a typical example:

@Aspect
@Component
public class LoggingAspect {
    
    private static final Logger logger = 
        LoggerFactory.getLogger(LoggingAspect.class);
    
    @Before("execution(* com.example.service.*.*(..))")
    public void logBefore(JoinPoint joinPoint) {
        logger.info("Executing: " + 
                   joinPoint.getSignature().getName());
    }
    
    @AfterReturning(
        pointcut = "execution(* com.example.service.*.*(..))",
        returning = "result")
    public void logAfterReturning(JoinPoint joinPoint, Object result) {
        logger.info("Method " + 
                   joinPoint.getSignature().getName() + 
                   " returned: " + result);
    }
    
    @Around("@annotation(com.example.Monitored)")
    public Object measureTime(ProceedingJoinPoint joinPoint) 
            throws Throwable {
        long start = System.currentTimeMillis();
        Object result = joinPoint.proceed();
        long time = System.currentTimeMillis() - start;
        logger.info(joinPoint.getSignature().getName() + 
                   " took " + time + " ms");
        return result;
    }
}

I later adopted Spring Framework that revolutionized Java development with dependency injection and aspect-oriented programming (AOP):

// Spring configuration
@Configuration
public class AppConfig {
    
    @Bean
    public CustomerService customerService() {
        return new CustomerServiceImpl(customerRepository());
    }
    
    @Bean
    public CustomerRepository customerRepository() {
        return new DatabaseCustomerRepository(dataSource());
    }
    
    @Bean
    public DataSource dataSource() {
        DriverManagerDataSource ds = new DriverManagerDataSource();
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUrl("jdbc:mysql://localhost/mydb");
        return ds;
    }
}

// Service class
@Service
public class CustomerServiceImpl implements CustomerService {
    private final CustomerRepository repository;
    
    @Autowired
    public CustomerServiceImpl(CustomerRepository repository) {
        this.repository = repository;
    }
    
    @Transactional
    public void updateBalance(String customerId, double newBalance) {
        Customer customer = repository.findById(customerId);
        customer.setBalance(newBalance);
        repository.save(customer);
    }
}

Spring Remoting

Spring added its own remoting protocols. HTTP Invoker serialized Java objects over HTTP:

// Server configuration
@Configuration
public class ServerConfig {
    
    @Bean
    public HttpInvokerServiceExporter customerService() {
        HttpInvokerServiceExporter exporter = 
            new HttpInvokerServiceExporter();
        exporter.setService(customerServiceImpl());
        exporter.setServiceInterface(CustomerService.class);
        return exporter;
    }
}

// Client configuration
@Configuration
public class ClientConfig {
    
    @Bean
    public HttpInvokerProxyFactoryBean customerService() {
        HttpInvokerProxyFactoryBean proxy = 
            new HttpInvokerProxyFactoryBean();
        proxy.setServiceUrl("http://localhost:8080/customer");
        proxy.setServiceInterface(CustomerService.class);
        return proxy;
    }
}

I learned that AOP addressed cross-cutting concerns elegantly for monoliths. But in microservices, these concerns moved to the infrastructure layer like service meshes, API gateways, and sidecars.

Proprietary Protocols

When working for large companies like Amazon, I encountered Amazon Coral, which is a proprietary RPC framework influenced by CORBA. Coral used an IDL to define service interfaces and supported multiple languages:

// Coral IDL
namespace com.amazon.example

structure CustomerData {
    1: required integer customerId
    2: required string name
    3: optional double balance
}

service CustomerService {
    CustomerData getCustomer(1: integer customerId)
    void updateCustomer(1: CustomerData customer)
    list<CustomerData> listCustomers()
}

The IDL compiler generated client and server code for Java, C++, and other languages. Coral handled serialization, versioning, and service discovery. When I later worked for AWS, I used Smithy that was successor Coral, which Amazon open-sourced. Here is a similar example of a Smithy contract:

namespace com.example

service CustomerService {
    version: "2024-01-01"
    operations: [
        GetCustomer
        UpdateCustomer
        ListCustomers
    ]
}

@readonly
operation GetCustomer {
    input: GetCustomerInput
    output: GetCustomerOutput
    errors: [CustomerNotFound]
}

structure GetCustomerInput {
    @required
    customerId: String
}

structure GetCustomerOutput {
    @required
    customer: Customer
}

structure Customer {
    @required
    customerId: String
    
    @required
    name: String
    
    balance: Double
}

@error("client")
structure CustomerNotFound {
    @required
    message: String
}

I learned IDL-first design remains valuable. Smithy learned from CORBA, Protocol Buffers, and Thrift.

Long Polling, WebSockets, and Real-Time

In late 2000s, I built real-time applications for streaming financial charts and technical data. I used long polling where the client made a request that the server held open until data was available:

// Client-side long polling
function pollServer() {
    fetch('/api/events')
        .then(response => response.json())
        .then(data => {
            console.log('Received event:', data);
            updateUI(data);
            
            // Immediately poll again
            pollServer();
        })
        .catch(error => {
            console.error('Polling error:', error);
            // Retry after delay
            setTimeout(pollServer, 5000);
        });
}

pollServer();

Server-side (Node.js):

const express = require('express');
const app = express();

let pendingRequests = [];

app.get('/api/events', (req, res) => {
    // Hold request open
    pendingRequests.push(res);
    
    // Timeout after 30 seconds
    setTimeout(() => {
        const index = pendingRequests.indexOf(res);
        if (index !== -1) {
            pendingRequests.splice(index, 1);
            res.json({ type: 'heartbeat' });
        }
    }, 30000);
});

// When an event occurs
function broadcastEvent(event) {
    pendingRequests.forEach(res => {
        res.json(event);
    });
    pendingRequests = [];
}

WebSockets

I also used WebSockets for real time applications that supported true bidirectional communication. However, earlier browsers didn’t fully support them so I used long polling as a fallback when websockets were not supported:

// Server (Node.js with ws library)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
    console.log('Client connected');
    
    // Send initial data
    ws.send(JSON.stringify({
        type: 'INIT',
        data: getInitialData()
    }));
    
    // Handle messages
    ws.on('message', (message) => {
        const msg = JSON.parse(message);
        
        if (msg.type === 'SUBSCRIBE') {
            subscribeToSymbol(ws, msg.symbol);
        }
    });
    
    ws.on('close', () => {
        console.log('Client disconnected');
        unsubscribeAll(ws);
    });
});

// Stream live data
function streamPriceUpdate(symbol, price) {
    wss.clients.forEach((client) => {
        if (client.readyState === WebSocket.OPEN) {
            if (isSubscribed(client, symbol)) {
                client.send(JSON.stringify({
                    type: 'PRICE_UPDATE',
                    symbol: symbol,
                    price: price,
                    timestamp: Date.now()
                }));
            }
        }
    });
}

Client:

const ws = new WebSocket('ws://localhost:8080');

ws.onopen = () => {
    console.log('Connected to server');
    
    // Subscribe to symbols
    ws.send(JSON.stringify({
        type: 'SUBSCRIBE',
        symbol: 'AAPL'
    }));
};

ws.onmessage = (event) => {
    const message = JSON.parse(event.data);
    
    switch (message.type) {
        case 'INIT':
            initializeChart(message.data);
            break;
        case 'PRICE_UPDATE':
            updateChart(message.symbol, message.price);
            break;
    }
};

ws.onerror = (error) => {
    console.error('WebSocket error:', error);
};

ws.onclose = () => {
    console.log('Disconnected, attempting reconnect...');
    setTimeout(connectWebSocket, 1000);
};

I learned that different problems need different protocols. REST works for request-response. WebSockets excel for real-time bidirectional communication.

Vert.x and Hazelcast for High-Performance Streaming

For a production streaming chart system handling high-volume market data, I used Vert.x with Hazelcast. Vert.x is a reactive toolkit built on Netty that excels at handling thousands of concurrent connections with minimal resources. Hazelcast provided distributed caching and coordination across multiple Vert.x instances. Market data flowed into Hazelcast distributed topics, Vert.x instances subscribed to these topics and pushed updates to connected WebSocket clients. If WebSocket wasn’t supported, we fell back to long polling automatically.

import io.vertx.core.Vertx;
import io.vertx.core.http.HttpServer;
import io.vertx.core.http.ServerWebSocket;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.ITopic;
import com.hazelcast.core.Message;
import com.hazelcast.core.MessageListener;
import java.util.concurrent.ConcurrentHashMap;
import java.util.Set;

public class MarketDataServer {
    private final Vertx vertx;
    private final HazelcastInstance hazelcast;
    private final ConcurrentHashMap<String, Set<ServerWebSocket>> subscriptions;
    
    public MarketDataServer() {
        this.vertx = Vertx.vertx();
        this.hazelcast = Hazelcast.newHazelcastInstance();
        this.subscriptions = new ConcurrentHashMap<>();
        
        // Subscribe to market data topic
        ITopic<MarketData> topic = hazelcast.getTopic("market-data");
        topic.addMessageListener(new MessageListener<MarketData>() {
            public void onMessage(Message<MarketData> message) {
                broadcastToSubscribers(message.getMessageObject());
            }
        });
    }
    
    public void start() {
        HttpServer server = vertx.createHttpServer();
        
        server.webSocketHandler(ws -> {
            String path = ws.path();
            
            if (path.startsWith("/stream/")) {
                String symbol = path.substring(8);
                handleWebSocketConnection(ws, symbol);
            } else {
                ws.reject();
            }
        });
        
        // Long polling fallback
        server.requestHandler(req -> {
            if (req.path().startsWith("/poll/")) {
                String symbol = req.path().substring(6);
                handleLongPolling(req, symbol);
            }
        });
        
        server.listen(8080, result -> {
            if (result.succeeded()) {
                System.out.println("Market data server started on port 8080");
            }
        });
    }
    
    private void handleWebSocketConnection(ServerWebSocket ws, String symbol) {
        subscriptions.computeIfAbsent(symbol, k -> ConcurrentHashMap.newKeySet())
                     .add(ws);
        
        ws.closeHandler(v -> {
            Set<ServerWebSocket> sockets = subscriptions.get(symbol);
            if (sockets != null) {
                sockets.remove(ws);
            }
        });
        
        // Send initial snapshot from Hazelcast cache
        IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
        MarketData data = cache.get(symbol);
        if (data != null) {
            ws.writeTextMessage(data.toJson());
        }
    }
    
    private void handleLongPolling(HttpServerRequest req, String symbol) {
        String lastEventId = req.getParam("lastEventId");
        
        // Hold request until data available or timeout
        long timerId = vertx.setTimer(30000, id -> {
            req.response()
               .putHeader("Content-Type", "application/json")
               .end("{\"type\":\"heartbeat\"}");
        });
        
        // Register one-time listener
        subscriptions.computeIfAbsent(symbol + ":poll", 
            k -> ConcurrentHashMap.newKeySet())
            .add(new PollHandler(req, timerId));
    }
    
    private void broadcastToSubscribers(MarketData data) {
        String symbol = data.getSymbol();
        
        // WebSocket subscribers
        Set<ServerWebSocket> sockets = subscriptions.get(symbol);
        if (sockets != null) {
            String json = data.toJson();
            sockets.forEach(ws -> {
                if (!ws.isClosed()) {
                    ws.writeTextMessage(json);
                }
            });
        }
        
        // Update Hazelcast cache for new subscribers
        IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
        cache.put(symbol, data);
    }
    
    public static void main(String[] args) {
        new MarketDataServer().start();
    }
}

Publishing market data to Hazelcast from data feed:

public class MarketDataPublisher {
    private final HazelcastInstance hazelcast;
    
    public void publishUpdate(String symbol, double price, long volume) {
        MarketData data = new MarketData(symbol, price, volume, 
                                         System.currentTimeMillis());
        
        // Publish to topic - all Vert.x instances receive it
        ITopic<MarketData> topic = hazelcast.getTopic("market-data");
        topic.publish(data);
    }
}

This architecture provided:

  • Vert.x Event Loop: Non-blocking I/O handled 10,000+ concurrent WebSocket connections per instance
  • Hazelcast Distribution: Market data shared across multiple Vert.x instances without a central message broker
  • Horizontal Scaling: Adding Vert.x instances automatically joined the Hazelcast cluster
  • Low Latency: Sub-millisecond message propagation within the cluster
  • Automatic Fallback: Clients detected WebSocket support; older browsers used long polling

Facebook Thrift and Google Protocol Buffers

I experimented with Facebook Thrift and Google Protocol Buffers that provided IDL-based RPC with multiple protocols: Here is an example of Protocol Buffers:

syntax = "proto3";

package customer;

message Customer {
    int32 customer_id = 1;
    string name = 2;
    double balance = 3;
}

service CustomerService {
    rpc GetCustomer(GetCustomerRequest) returns (Customer);
    rpc UpdateBalance(UpdateBalanceRequest) returns (UpdateBalanceResponse);
    rpc ListCustomers(ListCustomersRequest) returns (CustomerList);
}

message GetCustomerRequest {
    int32 customer_id = 1;
}

message UpdateBalanceRequest {
    int32 customer_id = 1;
    double new_balance = 2;
}

message UpdateBalanceResponse {
    bool success = 1;
}

message ListCustomersRequest {}

message CustomerList {
    repeated Customer customers = 1;
}

Python server with gRPC (which uses Protocol Buffers):

import grpc
from concurrent import futures
import customer_pb2
import customer_pb2_grpc

class CustomerServicer(customer_pb2_grpc.CustomerServiceServicer):
    
    def GetCustomer(self, request, context):
        return customer_pb2.Customer(
            customer_id=request.customer_id,
            name="John Doe",
            balance=5000.00
        )
    
    def UpdateBalance(self, request, context):
        print(f"Updating balance for {request.customer_id} " +
              f"to {request.new_balance}")
        return customer_pb2.UpdateBalanceResponse(success=True)
    
    def ListCustomers(self, request, context):
        customers = [
            customer_pb2.Customer(customer_id=1, name="Alice", balance=1000),
            customer_pb2.Customer(customer_id=2, name="Bob", balance=2000),
        ]
        return customer_pb2.CustomerList(customers=customers)

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    customer_pb2_grpc.add_CustomerServiceServicer_to_server(
        CustomerServicer(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    print("Server started on port 50051")
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

I learned that binary protocols offer significant efficiency gains. JSON is human-readable and convenient for debugging, but in high-performance scenarios, binary protocols like Protocol Buffers reduce payload size and serialization overhead.

Serverless and Lambda: Functions as a Service

Around 2015, AWS Lambda introduced serverless computing where you wrote functions, and AWS handled all the infrastructure:

// Lambda function (Node.js)
exports.handler = async (event) => {
    const customerId = event.queryStringParameters.customerId;
    
    // Query DynamoDB
    const AWS = require('aws-sdk');
    const dynamodb = new AWS.DynamoDB.DocumentClient();
    
    const result = await dynamodb.get({
        TableName: 'Customers',
        Key: { customerId: customerId }
    }).promise();
    
    if (result.Item) {
        return {
            statusCode: 200,
            body: JSON.stringify(result.Item)
        };
    } else {
        return {
            statusCode: 404,
            body: JSON.stringify({ error: 'Customer not found' })
        };
    }
};

Serverless was powerful with no servers to manage, automatic scaling, pay-per-invocation pricing. It felt like the Actor model I’d worked for my research that offered small, stateless, event-driven functions.

However, I also encountered several problems with serverless:

  • Cold starts: First invocation could be slow (though it has improved with recent updates)
  • Timeouts: Functions had maximum execution time (15 minutes for Lambda)
  • State management: Functions were stateless; you needed external state stores
  • Orchestration: Coordinating multiple functions was complex

The ping-pong anti-pattern emerged where Lambda A calls Lambda B, which calls Lambda C, which calls Lambda D. This created hard-to-debug systems with unpredictable costs. AWS Step Functions and Azure Durable Functions addressed orchestration:

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:CheckInventory",
      "Next": "ChargeCustomer"
    },
    "ChargeCustomer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ChargeCustomer",
      "Catch": [{
        "ErrorEquals": ["PaymentError"],
        "Next": "PaymentFailed"
      }],
      "Next": "ShipOrder"
    },
    "ShipOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ShipOrder",
      "End": true
    },
    "PaymentFailed": {
      "Type": "Fail",
      "Cause": "Payment processing failed"
    }
  }
}

gRPC: Modern RPC

In early 2020s, I started using gRPC extensively. It combined the best ideas from decades of RPC evolution:

  • Protocol Buffers for IDL
  • HTTP/2 for transport (multiplexing, header compression, flow control)
  • Strong typing with code generation
  • Streaming support (unary, server streaming, client streaming, bidirectional)

Here’s a gRPC service definition:

syntax = "proto3";

package customer;

service CustomerService {
    rpc GetCustomer(GetCustomerRequest) returns (Customer);
    rpc UpdateCustomer(Customer) returns (UpdateResponse);
    rpc StreamOrders(StreamOrdersRequest) returns (stream Order);
    rpc BidirectionalChat(stream ChatMessage) returns (stream ChatMessage);
}

message Customer {
    int32 customer_id = 1;
    string name = 2;
    double balance = 3;
}

message GetCustomerRequest {
    int32 customer_id = 1;
}

message UpdateResponse {
    bool success = 1;
    string message = 2;
}

message StreamOrdersRequest {
    int32 customer_id = 1;
}

message Order {
    int32 order_id = 1;
    double amount = 2;
    string status = 3;
}

message ChatMessage {
    string user = 1;
    string message = 2;
    int64 timestamp = 3;
}

Go server implementation:

package main

import (
    "context"
    "fmt"
    "log"
    "net"
    "time"
    
    "google.golang.org/grpc"
    pb "example.com/customer"
)

type server struct {
    pb.UnimplementedCustomerServiceServer
}

func (s *server) GetCustomer(ctx context.Context, req *pb.GetCustomerRequest) (*pb.Customer, error) {
    return &pb.Customer{
        CustomerId: req.CustomerId,
        Name:       "John Doe",
        Balance:    5000.00,
    }, nil
}

func (s *server) UpdateCustomer(ctx context.Context, customer *pb.Customer) (*pb.UpdateResponse, error) {
    log.Printf("Updating customer %d", customer.CustomerId)
    
    return &pb.UpdateResponse{
        Success: true,
        Message: "Customer updated successfully",
    }, nil
}

func (s *server) StreamOrders(req *pb.StreamOrdersRequest, stream pb.CustomerService_StreamOrdersServer) error {
    orders := []*pb.Order{
        {OrderId: 1, Amount: 99.99, Status: "shipped"},
        {OrderId: 2, Amount: 149.50, Status: "processing"},
        {OrderId: 3, Amount: 75.25, Status: "delivered"},
    }
    
    for _, order := range orders {
        if err := stream.Send(order); err != nil {
            return err
        }
        time.Sleep(time.Second)  // Simulate delay
    }
    
    return nil
}

func (s *server) BidirectionalChat(stream pb.CustomerService_BidirectionalChatServer) error {
    for {
        msg, err := stream.Recv()
        if err != nil {
            return err
        }
        
        log.Printf("Received: %s from %s", msg.Message, msg.User)
        
        // Echo back with server prefix
        response := &pb.ChatMessage{
            User:      "Server",
            Message:   fmt.Sprintf("Echo: %s", msg.Message),
            Timestamp: time.Now().Unix(),
        }
        
        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

func main() {
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }
    
    s := grpc.NewServer()
    pb.RegisterCustomerServiceServer(s, &server{})
    
    log.Println("Server listening on :50051")
    if err := s.Serve(lis); err != nil {
        log.Fatalf("Failed to serve: %v", err)
    }
}

Go client:

package main

import (
    "context"
    "io"
    "log"
    "time"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    pb "example.com/customer"
)

func main() {
    conn, err := grpc.Dial("localhost:50051", 
        grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer conn.Close()
    
    client := pb.NewCustomerServiceClient(conn)
    ctx := context.Background()
    
    // Unary call
    customer, err := client.GetCustomer(ctx, &pb.GetCustomerRequest{
        CustomerId: 12345,
    })
    if err != nil {
        log.Fatalf("GetCustomer failed: %v", err)
    }
    log.Printf("Customer: %v", customer)
    
    // Server streaming
    stream, err := client.StreamOrders(ctx, &pb.StreamOrdersRequest{
        CustomerId: 12345,
    })
    if err != nil {
        log.Fatalf("StreamOrders failed: %v", err)
    }
    
    for {
        order, err := stream.Recv()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatalf("Receive error: %v", err)
        }
        log.Printf("Order: %v", order)
    }
}

The Load Balancing Challenge

gRPC had one major gotcha in Kubernetes: connection persistence breaks load balancing. I documented this exhaustively in my blog post The Complete Guide to gRPC Load Balancing in Kubernetes and Istio. HTTP/2 multiplexes multiple requests over a single TCP connection. Once that connection is established to one pod, all requests go there. Kubernetes Service load balancing happens at L4 (TCP), so it doesn’t see individual gRPC calls and it only sees one connection. I used Istio’s Envoy sidecar, which operates at L7 and routes each gRPC call independently:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service
spec:
  host: grpc-service
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 100
        maxRequestsPerConnection: 10  # Force connection rotation
    loadBalancer:
      simple: LEAST_REQUEST  # Better than ROUND_ROBIN
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s

I learned that modern protocols solve old problems but introduce new ones. gRPC is excellent, but you must understand how it interacts with infrastructure. Production systems require deep integration between application protocol and deployment environment.

Modern Messaging and Streaming

I have been using Apache Kafka for many years that transformed how we think about data. It’s not just a message queue instead it’s a distributed commit log:

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

order = {
    'order_id': '12345',
    'customer_id': '67890',
    'amount': 99.99,
    'timestamp': time.time()
}

producer.send('orders', value=order)
producer.flush()

# Consumer
consumer = KafkaConsumer(
    'orders',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    group_id='order-processors'
)

for message in consumer:
    order = message.value
    print(f"Processing order: {order['order_id']}")
    # Process order

Kafka’s provided:

  • Durability: Messages are persisted to disk
  • Replayability: Consumers can reprocess historical events
  • Partitioning: Horizontal scalability through partitions
  • Consumer groups: Multiple consumers can process in parallel

Key Lesson: Event-driven architectures enable loose coupling and temporal decoupling. Systems can be rebuilt from the event log. This is Event Sourcing—a powerful pattern that Kafka makes practical at scale.

Agentic RPC: MCP and Agent-to-Agent Protocol

Over the last year, I have been building Agentic AI applications using Model Context Protocol (MCP) and more recently Agent-to-Agent (A2A) protocol. Both use JSON-RPC 2.0 underneath. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, we’ve come full circle to JSON-RPC for AI agents. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, it has come full circle to JSON-RPC for AI agents.

Service Discovery

A2A immediately reminded me of Sun’s Network Information Service (NIS), originally called Yellow Pages that I used in early 1990s. NIS provided a centralized directory service for Unix systems to look up user accounts, host names, and configuration data across a network. I saw this pattern repeated throughout the decades:

  • CORBA Naming Service (1990s): Objects registered themselves with a hierarchical naming service, and clients discovered them by name
  • JINI (late 1990s): Services advertised themselves via multicast, and clients discovered them through lookup registrars (as I described earlier in the JINI section)
  • UDDI (2000s): Universal Description, Discovery, and Integration for web services—a registry where SOAP services could be published and discovered
  • Consul, Eureka, etcd (2010s): Modern service discovery for microservices
  • Kubernetes DNS/Service Discovery (2010s-present): Built-in service registry and DNS-based discovery

Model Context Protocol (MCP)

MCP lets AI agents discover and invoke tools provided by servers. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. Here’s the MCP server that exposes tools to the AI agent:

from mcp.server import Server
import mcp.types as types
from typing import Any
import asyncio

class DailyMinutesServer:
    def __init__(self):
        self.server = Server("daily-minutes")
        self.setup_handlers()
        
    def setup_handlers(self):
        @self.server.list_tools()
        async def handle_list_tools() -> list[types.Tool]:
            return [
                types.Tool(
                    name="get_emails",
                    description="Fetch recent emails from inbox",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "hours": {
                                "type": "number",
                                "description": "Hours to look back"
                            },
                            "limit": {
                                "type": "number", 
                                "description": "Max emails to fetch"
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_hackernews",
                    description="Fetch top Hacker News stories",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "limit": {
                                "type": "number",
                                "description": "Number of stories"
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_rss_feeds",
                    description="Fetch latest RSS feed items",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "feed_urls": {
                                "type": "array",
                                "items": {"type": "string"}
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_weather",
                    description="Get current weather forecast",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "location": {"type": "string"}
                        }
                    }
                )
            ]
        
        @self.server.call_tool()
        async def handle_call_tool(
            name: str, 
            arguments: dict[str, Any]
        ) -> list[types.TextContent]:
            if name == "get_emails":
                result = await email_connector.fetch_recent(
                    hours=arguments.get("hours", 24),
                    limit=arguments.get("limit", 10)
                )
            elif name == "get_hackernews":
                result = await hn_connector.fetch_top_stories(
                    limit=arguments.get("limit", 10)
                )
            elif name == "get_rss_feeds":
                result = await rss_connector.fetch_feeds(
                    feed_urls=arguments["feed_urls"]
                )
            elif name == "get_weather":
                result = await weather_connector.get_forecast(
                    location=arguments["location"]
                )
            else:
                raise ValueError(f"Unknown tool: {name}")
            
            return [types.TextContent(
                type="text",
                text=json.dumps(result, indent=2)
            )]

Each connector is a simple async module. Here’s the Hacker News connector:

import aiohttp
from typing import List, Dict

class HackerNewsConnector:
    BASE_URL = "https://hacker-news.firebaseio.com/v0"
    
    async def fetch_top_stories(self, limit: int = 10) -> List[Dict]:
        async with aiohttp.ClientSession() as session:
            # Get top story IDs
            async with session.get(f"{self.BASE_URL}/topstories.json") as resp:
                story_ids = await resp.json()
            
            # Fetch details for top N stories
            stories = []
            for story_id in story_ids[:limit]:
                async with session.get(
                    f"{self.BASE_URL}/item/{story_id}.json"
                ) as resp:
                    story = await resp.json()
                    stories.append({
                        "title": story.get("title"),
                        "url": story.get("url"),
                        "score": story.get("score"),
                        "by": story.get("by"),
                        "time": story.get("time")
                    })
            
            return stories

RSS and weather connectors follow the same pattern—simple, focused modules that the MCP server orchestrates.

JSON-RPC Under the Hood

MCP is that it’s just JSON-RPC 2.0 over stdio or HTTP. Here’s what a tool call looks like on the wire:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "get_emails",
    "arguments": {
      "hours": 12,
      "limit": 5
    }
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [
      {
        "type": "text",
        "text": "[{\"from\": \"john@example.com\", \"subject\": \"Q4 Review\", ...}]"
      }
    ]
  }
}

After using Sun RPC, CORBA, SOAP, and gRPC, I appreciate MCP’s simplicity. It solves a specific problem: letting AI agents discover and invoke tools.

The Agent Workflow

My daily minutes agent follows this workflow:

  1. Agent calls get_emails to fetch recent messages
  2. Agent calls get_hackernews for tech news
  3. Agent calls get_rss_feeds for blog updates
  4. Agent calls get_weather for local forecast
  5. Agent synthesizes everything into a concise morning briefing

The AI decides which tools to call, in what order, based on the user’s preferences. I don’t hardcode the workflow.

Agent-to-Agent Protocol (A2A)

While MCP focuses on tool calling, A2A addresses agent-to-agent discovery and communication. It’s the modern equivalent of NIS/Yellow Pages for agents. Agents register their capabilities in a directory, and other agents discover and invoke them. A2A also uses JSON-RPC 2.0, but adds a discovery layer. Here’s how an agent registers itself:

from a2a import Agent, Capability

class ResearchAgent(Agent):
    def __init__(self):
        super().__init__(
            agent_id="research-agent-01",
            name="Research Agent",
            description="Performs web research and summarization"
        )
        
        # Register capabilities
        self.register_capability(
            Capability(
                name="web_search",
                description="Search the web for information",
                input_schema={
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "max_results": {"type": "integer", "default": 10}
                    },
                    "required": ["query"]
                },
                output_schema={
                    "type": "object",
                    "properties": {
                        "results": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "title": {"type": "string"},
                                    "url": {"type": "string"},
                                    "snippet": {"type": "string"}
                                }
                            }
                        }
                    }
                }
            )
        )
    
    async def handle_request(self, capability: str, params: dict):
        if capability == "web_search":
            return await self.perform_web_search(
                query=params["query"],
                max_results=params.get("max_results", 10)
            )
    
    async def perform_web_search(self, query: str, max_results: int):
        # Actual search implementation
        results = await search_engine.search(query, limit=max_results)
        return {"results": results}

Another agent discovers and invokes the research agent:

class CoordinatorAgent(Agent):
    def __init__(self):
        super().__init__(
            agent_id="coordinator-01",
            name="Coordinator Agent"
        )
        self.directory = AgentDirectory()
    
    async def research_topic(self, topic: str):
        # Discover agents with web_search capability
        agents = await self.directory.find_agents_with_capability("web_search")
        
        if not agents:
            raise Exception("No research agents available")
        
        # Select an agent (load balancing, availability, etc.)
        research_agent = agents[0]
        
        # Invoke the capability via JSON-RPC
        result = await research_agent.invoke(
            capability="web_search",
            params={
                "query": topic,
                "max_results": 20
            }
        )
        
        return result

The JSON-RPC exchange looks like this:

Discovery request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "directory.find_agents",
  "params": {
    "capability": "web_search",
    "filters": {
      "availability": "online"
    }
  }
}

Discovery response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "agents": [
      {
        "agent_id": "research-agent-01",
        "name": "Research Agent",
        "endpoint": "http://agent-service:8080/rpc",
        "capabilities": ["web_search"],
        "metadata": {
          "load": 0.3,
          "response_time_ms": 150
        }
      }
    ]
  }
}

The Security Problem

Though, I appreciate the simplicity of MCP and A2A but here’s what worries me: both protocols largely ignore decades of hard-won lessons about security. The Salesloft breach in 2024 showed exactly what happens: their AI chatbot stored authentication tokens for hundreds of services. MCP and A2A give us standard protocols for tool calling and agent coordination, which is valuable. But they create a false sense of security while ignoring fundamentals we solved decades ago:

  • Authentication: How do we verify an agent’s identity?
  • Authorization: What capabilities should this agent have access to?
  • Credential rotation: How do we handle token expiration and renewal?
  • Observability: How do we trace agent interactions for debugging and auditing?
  • Principle of least privilege: How do we ensure agents only access what they need?
  • Rate limiting: How do we prevent a misbehaving agent from overwhelming services?

The community needs to address this before A2A and MCP see widespread enterprise adoption.

Lessons Learned

1. Complexity is the Enemy

Every failed technology I’ve used failed because of complexity. CORBA, SOAP, EJB—they all collapsed under their own weight. Successful technologies like REST, gRPC, Kafka focused on doing one thing well.

Implication: Be suspicious of solutions that try to solve every problem. Prefer composable, focused tools.

2. Network Calls Are Expensive

The first Fallacy of Distributed Computing haunts us still: The network is not reliable. It’s also not zero latency, infinite bandwidth, or secure. I’ve watched this lesson be relearned in every generation:

  • EJB entity beans made chatty network calls
  • Microservices make chatty REST calls
  • GraphQL makes chatty database queries

Implication: Design APIs to minimize round trips. Batch operations. Cache aggressively. Monitor network latency religiously. (See my blog on fault tolerance in microservices for details.)

3. Statelessness Scales

Stateless services scale horizontally. But real applications need state—session data, shopping carts, user preferences. The solution isn’t to make services stateful instead it’s to externalize state:

  • Session stores (Redis, Memcached)
  • Databases (PostgreSQL, DynamoDB)
  • Event logs (Kafka)
  • Distributed caches

Implication: Keep service logic stateless. Push state to specialized systems designed for it.

4. The Actor Model Is Underappreciated

My research with actors and Linda memory model convinced me that the Actor model simplifies concurrent and distributed systems. Today’s serverless functions are essentially actors. Frameworks like Akka, Orleans, and Dapr embrace it. Actors eliminate shared mutable shared state, which the source of most concurrency bugs.

Implication: For event-driven systems, consider Actor-based frameworks. They map naturally to distributed problems.

5. Observability

Modern distributed systems require extensive instrumentation. You need:

  • Structured logging with correlation IDs
  • Metrics for performance and health
  • Distributed tracing to follow requests across services
  • Alarms with proper thresholds

Implication: Instrument your services from day one. Observability is infrastructure, not a nice-to-have. (See my blog posts on fault tolerance and load shedding for specific metrics.)

6. Throttling and Load Shedding

Every production system eventually faces traffic spikes or DDoS attacks. Without throttling and load shedding, your system will collapse. Key techniques:

  • Rate limiting by client/user/IP
  • Admission control based on queue depth
  • Circuit breakers to fail fast
  • Backpressure to slow down producers

Implication: Build throttling and load shedding into your architecture early. They’re harder to retrofit. (See my comprehensive blog post on this topic.)

7. Idempotency

Network failures mean requests may be retried. If your operations aren’t idempotent, you’ll process payments twice, create duplicate orders, and corrupt data (See my blog on idempotency topic). Make operations idempotent:

  • Use idempotency keys
  • Check if operation already succeeded
  • Design APIs to be safely retryable

Implication: Every non-read operation should be idempotent. It saves you from a world of hurt.

8. External and Internal APIs Should Differ

I have learned that external APIs need a good UX and developer empathy so that APIs are intuitive, consistent, well-documented. Internal APIs can optimize for performance, reliability, and operational needs. Don’t expose your internal architecture to external consumers. Use API gateways to translate between external contracts and internal services.

Implication: Design external APIs for developers using them. Design internal APIs for operational excellence.

9. Standards Beat Proprietary Solutions

Novell IPX failed because it was proprietary. Sun RPC succeeded as an open standard. REST thrived because it built on HTTP. gRPC uses open standards (HTTP/2, Protocol Buffers).

Implication: Prefer open standards. If you must use proprietary tech, understand the exit strategy.

10. Developer Experience Matters

Technologies with great developer experience get adopted. Java succeeded because it was easier than C++. REST beat SOAP because it was simpler. Kubernetes won because it offered a powerful abstraction.

Implication: Invest in developer tools, documentation, and ergonomics. Friction kills momentum.

Upcoming Trends

WebAssembly: The Next Runtime

WebAssembly (Wasm) is emerging as a universal runtime. Code written in Rust, Go, C, or AssemblyScript compiles to Wasm and runs anywhere. Platforms like wasmCloud, Fermyon, and Lunatic are building Actor-based systems on Wasm. Combined with the Component Model and WASI (WebAssembly System Interface), Wasm offers near-native performance, strong sandboxing, and portability. It might replace Docker containers for some workloads. Solomon Hykes, creator of Docker, famously said:

“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!” — Solomon Hykes, March 2019

WebAssembly isn’t ready yet. Critical gaps:

  • WASI maturity: Still evolving (Preview 2 in development)
  • Async I/O: Limited compared to native runtimes
  • Database drivers: Many don’t support WASM
  • Networking: WASI sockets still experimental
  • Ecosystem tooling: Debugging, profiling still primitive

Service Meshes

Istio, Linkerd, Dapr move cross-cutting concerns out of application code:

  • Authentication/authorization
  • Rate limiting
  • Circuit breaking
  • Retries with exponential backoff
  • Distributed tracing
  • Metrics collection

Tradeoff: Complexity shifts from application code to infrastructure. Teams need deep Kubernetes and service mesh expertise.

The Edge Is Growing

Edge computing brings computation closer to users. CDNs like Cloudflare Workers and Fastly Compute@Edge run code globally with single-digit millisecond latency. This requires new thinking like eventual consistency, CRDTs (Conflict-free Replicated Data Types), and geo-distributed state management.

AI Agents and Multi-Agent Systems

I’m currently building agentic AI systems using LangGraph, RAG, and MCP. These are inherently distributed and agents communicate asynchronously, maintain local state, and coordinate through message passing. It’s the Actor model again.

What’s Missing

Despite all this progress, we still struggle with:

  • Distributed transactions: Two-phase commit doesn’t scale; SAGA patterns are complex
  • Testing distributed systems: Mocking services, simulating failures, and reproducing production bugs remain hard. I have written a number of tools for mock testing.
  • Observability at scale: Tracing millions of requests generates too much data
  • Cost management: Cloud bills spiral as systems grow
  • Cognitive load: Modern systems require expertise in dozens of technologies

Conclusion

I’ve been writing network code for decades and have used dozens of protocols, frameworks, and paradigms. Here is what I have learned:

  • Simplicity beats complexity (SOAP died, REST thrived)
  • Network calls aren’t free (EJB entity beans, chatty microservices)
  • State is hard; externalize it (Erlang, serverless functions)
  • Observability is essential (You can’t fix what you can’t see)
  • Developer experience matters (Java beat C++, REST beat SOAP)
  • Make It Work, Then Make It Fast
  • Design for Failure from Day One (Systems built with circuit breakers, retries, timeouts, and graceful degradation from the start).

Other tips from evolution of remote services include:

  • Design systems as message-passing actors from the start. Whether that’s Erlang processes, Akka actors, Orleans grains, or Lambda functions—embrace isolated state and message passing.
  • Invest in Observability with structured logging with correlation IDs, instrumented metrics, distributed tracing and alarms.
  • Separate External and Internal APIs. Use REST or GraphQL for external APIs (with versioning) and use gRPC or Thrift for internal communication (efficient).
  • Build Throttling and Load Shedding by rate limiting by client/user/IP at the edge and implement admission control at the service level (See my blog on Effective Load Shedding and Throttling).
  • Make Everything Idempotent as networks fail and requests get retried. Use idempotency keys for all mutations.
  • Choose Boring Technology (See Choose Boring Technology). For your core infrastructure, use proven tech (PostgreSQL, Redis, Kafka).
  • Test for Failure. Most code only handles the happy path. Production is all about unhappy paths.
  • Learn about the Fallacies of Distributed Computing and read A Note on Distributed Computing (1994).
  • Make chaos engineering part of CI/CD and use property-based testing (See my blog on property-based testing).

The technologies change like mainframes to serverless, Assembly to Go, CICS to Kubernetes. But the underlying principles remain constant. We oscillate between extremes:

  • Monoliths -> Microservices -> (now) Modular Monoliths
  • Strongly typed IDLs (CORBA) -> Untyped JSON -> Strongly typed again (gRPC)
  • Centralized -> Distributed -> Edge -> (soon) Peer-to-peer?
  • Synchronous RPC -> Asynchronous messaging -> Reactive streams

Each swing teaches us something. CORBA was too complex, but IDL-first design is valuable. REST was liberating, but binary protocols are more efficient. Microservices enable agility, but operational complexity explodes. The sweet spot is usually in the middle. Modular monoliths with clear boundaries. REST for external APIs, gRPC for internal communication. Some synchronous calls, some async messaging.

Here are a few trends that I see becoming prevalent:

  1. WebAssembly may replace containers for some workloads: Faster startup, better security with platforms like wasmCloud and Fermyon.
  2. Service meshes are becoming invisible: Currently they are too complex. Ambient mesh (no sidecars) and eBPF-based routing are gaining wider adoption.
  3. The Actor model will eat the world: Serverless functions are actors and durable functions are actor orchestration.
  4. Edge computing will force new patterns: We can’t rely on centralized state and may need CRDTs and eventual consistency.
  5. AI agents will need distributed coordination. Multi-agent systems = distributed systems and may need message passing between agents.

The best engineers don’t just learn the latest framework, they study the history, understand the trade-offs, and recognize when old ideas solve new problems. The future of distributed systems won’t be built by inventing entirely new paradigms instead it’ll be built by taking the best ideas from the past, learning from the failures, and applying them with better tools.


Check out my other blog posts:

Older Posts »

Powered by WordPress