Shahzad Bhatti Welcome to my ramblings and rants!

December 3, 2025

Building Production-Grade AI Agents with MCP & A2A: A Complete Guide from the Trenches

Filed under: Computing,Uncategorized — admin @ 12:36 pm

Problem Statement

I’ve spent the last year building AI agents in enterprise environments. During this time, I’ve extensively applied emerging standards like Model Context Protocol (MCP) from Anthropic and the more recent Agent-to-Agent (A2A) Protocol for agent communication and coordination. What I’ve learned: there’s a massive gap between building a quick proof-of-concept with these protocols and deploying a production-grade system. The concerns that get overlooked in production deployments are exactly what will take you down at 3 AM:

  • Multi-tenant isolation with row-level security (because one leaked document = lawsuit)
  • JWT-based authentication across microservices (no shared sessions, fully stateless)
  • Real-time observability of agent actions (when agents misbehave, you need to know WHY)
  • Cost tracking and budgeting per user and model (because OpenAI bills compound FAST)
  • Hybrid search combining BM25 and vector embeddings (keyword matching + semantic understanding)
  • Graceful degradation when embeddings aren’t available (real data is messy)
  • Integration testing against real databases (mocks lie to you)

Disregarding security concerns can lead to incidents like the Salesloft breach where their AI chatbot inadvertently stored authentication tokens for hundreds of services, which exposed customer data across multiple platforms. More recently in October 2025, Filevine (a billion-dollar legal AI platform) exposed 100,000+ confidential legal documents through an unauthenticated API endpoint that returned full admin tokens to their Box filesystem. No authentication required, just a simple API call. I’ve personally witnessed security issues from inadequate AuthN/AuthZ controls and cost overruns exceeding hundreds of thousands of dollars, which are preventable with proper security and budget enforcement.

The good news is that MCP and A2A protocols provide the foundation to solve these problems. Most articles treat these as competing standards but they are complementary. In this guide, I’ll show you exactly how to combine MCP and A2A to build a system that handles real production concerns: multi-tenancy, authentication, cost control, and observability.

Reference Implementation

To demonstrate these concepts in action, I’ve built a reference implementation that showcases production-ready patterns.

Architecture Philosophy:

Three principles guided every decision:

  1. Go for servers, Python for workflows – Use the right tool for each job. Go handles high-throughput protocol servers. Python handles AI workflows.
  2. Database-level security – Multi-tenancy enforced via PostgreSQL row-level security (RLS), not application code. Impossible to bypass accidentally.
  3. Stateless everything – Every service can scale horizontally. No sticky sessions, no shared state, no single points of failure.

All containerized, fully tested, and ready for production deployment.

Tech Stack Summary:

  • Go 1.22 (protocol servers)
  • Python 3.11 (AI workflows)
  • PostgreSQL 16 + pgvector (vector search with RLS)
  • Ollama (local LLM)
  • Docker Compose (local development)
  • Kubernetes manifests (production deployment)

GitHub: Complete implementation available

But before we dive into the implementation, let’s understand the fundamental problem these protocols solve and why you need both.


Part 1: Understanding MCP and A2A

The Core Problem: Integration Chaos

Prior to MCP protocol in 2024, you had to build custom integration with LLM providers, data sources and AI frameworks. Every AI application had to reinvent authentication, data access, and orchestration, which doesn’t scale. MCP and A2A emerged to solve different aspects of this chaos:

The MCP Side: Standardized Tool Execution

Think of MCP as a standardized toolbox for AI models. Instead of every AI application writing custom integrations for databases, APIs, and file systems, MCP provides a JSON-RPC 2.0 protocol that models use to:

  • Call tools (search documents, retrieve data, update records)
  • Access resources (files, databases, APIs)
  • Send prompts (inject context into model calls)

From the MCP vs A2A comparison:

“MCP excels at synchronous, stateless tool execution. It’s perfect when you need an AI model to retrieve information, execute a function, and return results immediately.”

Here’s what MCP looks like in practice:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "hybrid_search",
    "arguments": {
      "query": "machine learning best practices",
      "limit": 5,
      "bm25_weight": 0.5,
      "vector_weight": 0.5
    }
  }
}

The server executes the tool and returns results. Simple, stateless, fast.

Why JSON-RPC 2.0? Because it’s:

  • Language-agnostic – Works with any language that speaks HTTP
  • Batch-capable – Multiple requests in one HTTP call
  • Error-standardized – Consistent error codes across implementations
  • Widely adopted – 20+ years of production battle-testing

The A2A Side: Stateful Workflow Orchestration

A2A handles what MCP doesn’t: multi-step, stateful workflows where agents collaborate. From the A2A Protocol docs:

“A2A is designed for asynchronous, stateful orchestration of complex tasks that require multiple steps, agent coordination, and long-running processes.”

A2A provides:

  • Task creation and management with persistent state
  • Real-time streaming of progress updates (Server-Sent Events)
  • Agent coordination across multiple services
  • Artifact management for intermediate results

Why Both Protocols Matter

Here’s a real scenario from my fintech work that illustrates why you need both:

Use Case: Compliance analyst needs to research a company across 10,000 documents, verify regulatory compliance, cross-reference with SEC filings, and generate an audit-ready report.

With MCP alone:

  • ? No way to track multi-step progress
  • ? Can’t coordinate multiple tools
  • ? No intermediate result storage
  • ? Client must orchestrate everything

With A2A alone:

  • ? Every tool is custom-integrated
  • ? No standardized data access
  • ? Reinventing authentication per tool
  • ? Coupling agent logic to data sources

With MCP + A2A:

  • ? A2A orchestrates the multi-step workflow
  • ? MCP provides standardized tool execution
  • ? Real-time progress via SSE
  • ? Stateful coordination with stateless tools
  • ? Authentication handled once (JWT in MCP)
  • ? Intermediate results stored as artifacts

As noted in OneReach’s guide:

“Use MCP when you need fast, stateless tool execution. Use A2A when you need complex, stateful orchestration. Use both when building production systems.”


Part 2: Architecture

System Overview

Key Design Decisions

Protocol Servers (Go):

  • MCP Server – Secure document retrieval with pgvector and hybrid search. Go’s concurrency model handles 5,000+ req/sec, and its type safety catches integration bugs at compile time (not at runtime).
  • A2A Server – Multi-step workflow orchestration with Server-Sent Events for real-time progress tracking. Stateless design enables horizontal scaling.

AI Workflows (Python):

  • LangGraph Workflows – RAG, research, and hybrid pipelines. Python was the right choice here because the AI ecosystem (LangChain, embeddings, model integrations) lives in Python.

User Interface & Database:

  • Streamlit UI – Production-ready authentication, search interface, cost tracking dashboard, and real-time task streaming
  • PostgreSQL with pgvector – Multi-tenant document storage with row-level security policies enforced at the database level (not application level)
  • Ollama – Local LLM inference for development and testing (no OpenAI API keys required)

Database Security:

Application-level tenant filtering for database is not enough so row-level security policies are enforced:

// ? BAD: Application-level filtering (can be bypassed)
func GetDocuments(tenantID string) ([]Document, error) {
    query := "SELECT * FROM documents WHERE tenant_id = ?"
    // What if someone forgets the WHERE clause?
    // What if there's a SQL injection?
    // What if a bug skips this check?
}
-- ? GOOD: Database-level Row-Level Security (impossible to bypass)
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON documents
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

Every query automatically filters by tenant so there is no way to accidentally leak data. Even if your application has a bug, the database enforces isolation.

JWT Authentication

MCP server and UI share RSA keys for token verification, which provides:

  • Asymmetric: MCP server only needs public key (can’t forge tokens)
  • Rotation: Rotate private key without redeploying services
  • Auditability: Know which key signed which token
  • Standard: Widely supported, well-understood
// mcp-server/internal/auth/jwt.go
func (v *JWTValidator) ValidateToken(tokenString string) (*Claims, error) {
    token, err := jwt.ParseWithClaims(tokenString, &Claims{}, func(token *jwt.Token) (interface{}, error) {
        if _, ok := token.Method.(*jwt.SigningMethodRSA); !ok {
            return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"])
        }
        return v.publicKey, nil
    })

    if err != nil {
        return nil, fmt.Errorf("failed to parse token: %w", err)
    }

    claims, ok := token.Claims.(*Claims)
    if !ok || !token.Valid {
        return nil, fmt.Errorf("invalid token claims")
    }

    return claims, nil
}

Tokens are validated on every request—no session state, fully stateless.

4. Hybrid Search

In some of past RAG implementation, I used Vector search alone, which is not enough for production RAG.

Why hybrid search matters:

ScenarioBM25 (Keyword)Vector (Semantic)Hybrid
Exact term: “GDPR Article 17”? Perfect? Misses? Perfect
Concept: “right to be forgotten”? Misses? Good? Perfect
Legal citation: “Smith v. Jones 2024”? Perfect? Poor? Perfect
Misspelling: “machien learning”? Misses? Finds? Finds

Real-world example from my fintech work:

Query: "SEC disclosure requirements GDPR data breach"

Vector-only results:
1. "Privacy Policy" (0.87 similarity)
2. "Data Protection Guide" (0.84 similarity)  
3. "General Security Practices" (0.81 similarity)
? Missed: Actual SEC regulation text

Hybrid results (0.5 BM25 + 0.5 Vector):
1. "SEC Rule 10b-5 Disclosure Requirements" (0.92 combined)
2. "GDPR Article 33 Breach Notification" (0.89 combined)
3. "Cross-Border Regulatory Compliance" (0.85 combined)
? Found: Exactly what we needed

The reference implementation (hybrid_search.go) uses PostgreSQL’s full-text search (BM25-like) combined with pgvector:

// Hybrid search query using Reciprocal Rank Fusion
query := `
    WITH bm25_results AS (
        SELECT
            id,
            ts_rank_cd(
                to_tsvector('english', title || ' ' || content),
                plainto_tsquery('english', $1)
            ) AS bm25_score,
            ROW_NUMBER() OVER (ORDER BY ts_rank_cd(...) DESC) AS bm25_rank
        FROM documents
        WHERE to_tsvector('english', title || ' ' || content) @@ plainto_tsquery('english', $1)
    ),
    vector_results AS (
        SELECT
            id,
            1 - (embedding <=> $2) AS vector_score,
            ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS vector_rank
        FROM documents
        WHERE embedding IS NOT NULL
    ),
    combined AS (
        SELECT
            COALESCE(b.id, v.id) AS id,
            -- Reciprocal Rank Fusion score
            (
                COALESCE(1.0 / (60 + b.bm25_rank), 0) * $3 +
                COALESCE(1.0 / (60 + v.vector_rank), 0) * $4
            ) AS combined_score
        FROM bm25_results b
        FULL OUTER JOIN vector_results v ON b.id = v.id
    )
    SELECT * FROM combined
    ORDER BY combined_score DESC
    LIMIT $7
`

Why Reciprocal Rank Fusion (RRF)? Because:

  • Score normalization: BM25 scores and vector similarities aren’t comparable
  • Rank-based: Uses position, not raw scores
  • Research-backed: Used by search engines (Elasticsearch, Vespa)
  • Tunable: Adjust k parameter (60 in our case) for different behaviors

Part 3: The MCP Server – Secure Document Retrieval

Understanding JSON-RPC 2.0

Before we dive into implementation, let’s understand why MCP chose JSON-RPC 2.0.

JSON-RPC 2.0 Request Structure:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "hybrid_search",
    "arguments": {"query": "machine learning", "limit": 10}
  }
}

JSON-RPC 2.0 Response Structure:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [{
      "type": "text",
      "text": "[{\"doc_id\": \"123\", \"title\": \"ML Guide\", ...}]"
    }],
    "isError": false
  }
}

Error Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "error": {
    "code": -32602,
    "message": "Invalid params",
    "data": {"field": "query", "reason": "required"}
  }
}

Standard Error Codes:

  • -32700: Parse error (invalid JSON)
  • -32600: Invalid request (missing required fields)
  • -32601: Method not found
  • -32602: Invalid params
  • -32603: Internal error

Custom MCP Error Codes:

  • -32001: Authentication required
  • -32002: Authorization failed
  • -32003: Rate limit exceeded
  • -32004: Resource not found
  • -32005: Validation error

MCP Tool Implementation

MCP tools follow a standard interface:

// mcp-server/internal/tools/tool.go
type Tool interface {
    Definition() protocol.ToolDefinition
    Execute(ctx context.Context, args map[string]interface{}) (protocol.ToolCallResult, error)
}

Here’s the complete hybrid search tool (hybrid_search.go) implementation with detailed comments:

// mcp-server/internal/tools/hybrid_search.go
type HybridSearchTool struct {
    db database.Store
}

func (t *HybridSearchTool) Execute(ctx context.Context, args map[string]interface{}) (protocol.ToolCallResult, error) {
    // 1. AUTHENTICATION: Extract tenant from JWT claims
    //    This happens at middleware level, but we verify here
    tenantID, ok := ctx.Value(auth.ContextKeyTenantID).(string)
    if !ok {
        return protocol.ToolCallResult{IsError: true}, fmt.Errorf("tenant ID not found in context")
    }

    // 2. PARAMETER PARSING: Extract and validate arguments
    query, _ := args["query"].(string)
    if query == "" {
        return protocol.ToolCallResult{IsError: true}, fmt.Errorf("query is required")
    }
    
    limit, _ := args["limit"].(float64)
    if limit <= 0 {
        limit = 10 // default
    }
    if limit > 50 {
        limit = 50 // max cap
    }
    
    bm25Weight, _ := args["bm25_weight"].(float64)
    vectorWeight, _ := args["vector_weight"].(float64)
    
    // 3. WEIGHT NORMALIZATION: Ensure weights sum to 1.0
    if bm25Weight == 0 && vectorWeight == 0 {
        bm25Weight = 0.5
        vectorWeight = 0.5
    }

    // 4. EMBEDDING GENERATION: Using Ollama for query embedding
    var embedding []float32
    if vectorWeight > 0 {
        embedding = generateEmbedding(query) // Calls Ollama API
    }

    // 5. DATABASE QUERY: Execute hybrid search with RLS
    params := database.HybridSearchParams{
        Query:        query,
        Embedding:    embedding,
        Limit:        int(limit),
        BM25Weight:   bm25Weight,
        VectorWeight: vectorWeight,
    }

    results, err := t.db.HybridSearch(ctx, tenantID, params)
    if err != nil {
        return protocol.ToolCallResult{IsError: true}, err
    }

    // 6. RESPONSE FORMATTING: Convert to JSON for client
    jsonData, _ := json.Marshal(results)
    return protocol.ToolCallResult{
        Content: []protocol.ContentBlock{{Type: "text", Text: string(jsonData)}},
        IsError: false,
    }, nil
}

The NULL Embedding Problem

Real-world data is messy. Not every document has an embedding. Here’s what happened:

Initial Implementation (Broken):

// ? This crashes with NULL embeddings
var embedding pgvector.Vector

err = tx.QueryRow(ctx, query, docID).Scan(
    &doc.ID,
    &doc.TenantID,
    &doc.Title,
    &doc.Content,
    &doc.Metadata,
    &embedding, // CRASH: can't scan <nil> into pgvector.Vector
    &doc.CreatedAt,
    &doc.UpdatedAt,
)

Error:

can't scan into dest[5]: unsupported data type: <nil>

The Fix (Correct):

// ? Use pointer types for nullable fields
var embedding *pgvector.Vector // Pointer allows NULL

err = tx.QueryRow(ctx, query, docID).Scan(
    &doc.ID,
    &doc.TenantID,
    &doc.Title,
    &doc.Content,
    &doc.Metadata,
    &embedding, // Can be NULL now
    &doc.CreatedAt,
    &doc.UpdatedAt,
)

// Handle NULL embeddings gracefully
if embedding != nil && embedding.Slice() != nil {
    doc.Embedding = embedding.Slice()
} else {
    doc.Embedding = nil // Explicitly set to nil
}

return doc, nil

Hybrid search handles this elegantly—documents without embeddings get vector_score = 0 but still appear in results if they match BM25:

-- Hybrid search handles NULL embeddings gracefully
WITH bm25_results AS (
    SELECT id, ts_rank(to_tsvector('english', content), query) AS bm25_score
    FROM documents
    WHERE to_tsvector('english', content) @@ query
),
vector_results AS (
    SELECT id, 1 - (embedding <=> $1) AS vector_score
    FROM documents
    WHERE embedding IS NOT NULL  -- ? Skip NULL embeddings
)
SELECT
    d.*,
    COALESCE(b.bm25_score, 0) AS bm25_score,
    COALESCE(v.vector_score, 0) AS vector_score,
    ($2 * COALESCE(b.bm25_score, 0) + $3 * COALESCE(v.vector_score, 0)) AS combined_score
FROM documents d
LEFT JOIN bm25_results b ON d.id = b.id
LEFT JOIN vector_results v ON d.id = v.id
WHERE COALESCE(b.bm25_score, 0) > 0 OR COALESCE(v.vector_score, 0) > 0
ORDER BY combined_score DESC
LIMIT $4;

Why this matters:

  • ? Documents without embeddings still searchable (BM25)
  • ? New documents usable immediately (embeddings generated async)
  • ? System degrades gracefully (not all-or-nothing)
  • ? Zero downtime for embedding model updates

Tenant Isolation in Action

Every MCP request sets the tenant context at the database transaction level:

// mcp-server/internal/database/postgres.go
func (db *DB) SetTenantContext(ctx context.Context, tx pgx.Tx, tenantID string) error {
    // Note: SET commands don't support parameter binding
    // TenantID is validated as UUID by JWT validator, so this is safe
    query := fmt.Sprintf("SET LOCAL app.current_tenant_id = '%s'", tenantID)
    _, err := tx.Exec(ctx, query)
    return err
}

Combined with RLS policies, this ensures complete tenant isolation at the database level.

Real-world security test:

// Integration test: Verify tenant isolation
func TestTenantIsolation(t *testing.T) {
    // Create documents for two tenants
    tenant1Doc := createDocument(t, db, "tenant-1", "Secret Data A")
    tenant2Doc := createDocument(t, db, "tenant-2", "Secret Data B")
    
    // Query as tenant-1
    ctx1 := contextWithTenant(ctx, "tenant-1")
    results1, _ := db.ListDocuments(ctx1, "tenant-1", ListParams{Limit: 100})
    
    // Query as tenant-2
    ctx2 := contextWithTenant(ctx, "tenant-2")
    results2, _ := db.ListDocuments(ctx2, "tenant-2", ListParams{Limit: 100})
    
    // Assertions
    assert.Contains(t, results1, tenant1Doc)
    assert.NotContains(t, results1, tenant2Doc) // ? Cannot see other tenant
    
    assert.Contains(t, results2, tenant2Doc)
    assert.NotContains(t, results2, tenant1Doc) // ? Cannot see other tenant
}

Part 4: The A2A Server – Workflow Orchestration

Task Lifecycle

A2A manages stateful tasks through their entire lifecycle:

Server-Sent Events for Real-Time Updates

Why SSE instead of WebSockets?

FeatureSSEWebSocket
Unidirectional? Yes (server?client)? No (bidirectional)
HTTP/2 multiplexing? Yes? No
Automatic reconnection? Built-in? Manual
Firewall-friendly? Yes (HTTP)?? Sometimes blocked
Complexity? Simple? Complex
Browser support? All modern? All modern

SSE is perfect for agent progress updates because:

  • One-way communication (server pushes updates)
  • Simple implementation
  • Automatic reconnection
  • Works through corporate firewalls

SSE provides real-time streaming without WebSocket complexity:

// a2a-server/internal/handlers/tasks.go
func (h *TaskHandler) StreamEvents(w http.ResponseWriter, r *http.Request) {
    taskID := chi.URLParam(r, "taskId")

    // Set SSE headers
    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    w.Header().Set("Connection", "keep-alive")
    w.Header().Set("Access-Control-Allow-Origin", "*")

    flusher, ok := w.(http.Flusher)
    if !ok {
        http.Error(w, "Streaming not supported", http.StatusInternalServerError)
        return
    }

    // Stream task events
    for {
        event := h.taskManager.GetNextEvent(taskID)
        if event == nil {
            break // Task complete
        }

        // Format as SSE event
        data, _ := json.Marshal(event)
        fmt.Fprintf(w, "event: task_update\n")
        fmt.Fprintf(w, "data: %s\n\n", data)
        flusher.Flush()

        if event.Status == "completed" || event.Status == "failed" {
            break
        }
    }
}

Client-side consumption is trivial:

# streamlit-ui/pages/3_?_A2A_Tasks.py
def stream_task_events(task_id: str):
    url = f"{A2A_BASE_URL}/tasks/{task_id}/events"

    with requests.get(url, stream=True) as response:
        for line in response.iter_lines():
            if line.startswith(b'data:'):
                data = json.loads(line[5:])
                st.write(f"Update: {data['message']}")
                yield data

LangGraph Workflow Integration

LangGraph workflows call MCP tools through the A2A server:

# orchestration/workflows/rag_workflow.py
class RAGWorkflow:
    def __init__(self, mcp_url: str):
        self.mcp_client = MCPClient(mcp_url)
        self.workflow = self.build_workflow()

    def build_workflow(self) -> StateGraph:
        workflow = StateGraph(RAGState)

        # Define workflow steps
        workflow.add_node("search", self.search_documents)
        workflow.add_node("rank", self.rank_results)
        workflow.add_node("generate", self.generate_answer)
        workflow.add_node("verify", self.verify_sources)

        # Define edges (workflow graph)
        workflow.add_edge(START, "search")
        workflow.add_edge("search", "rank")
        workflow.add_edge("rank", "generate")
        workflow.add_edge("generate", "verify")
        workflow.add_edge("verify", END)

        return workflow.compile()

    def search_documents(self, state: RAGState) -> RAGState:
        """Search for relevant documents using MCP hybrid search"""
        # This is where MCP and A2A integrate!
        results = self.mcp_client.hybrid_search(
            query=state["query"],
            limit=10,
            bm25_weight=0.5,
            vector_weight=0.5
        )

        state["documents"] = results
        state["progress"] = f"Found {len(results)} documents"
        
        # Emit progress event via A2A
        emit_progress_event(state["task_id"], "search_complete", state["progress"])
        
        return state

    def rank_results(self, state: RAGState) -> RAGState:
        """Rank results by combined score"""
        docs = sorted(
            state["documents"],
            key=lambda x: x["score"],
            reverse=True
        )[:5]

        state["ranked_docs"] = docs
        state["progress"] = "Ranked top 5 documents"
        
        emit_progress_event(state["task_id"], "ranking_complete", state["progress"])
        
        return state

    def generate_answer(self, state: RAGState) -> RAGState:
        """Generate answer using retrieved context"""
        context = "\n\n".join([
            f"Document: {doc['title']}\n{doc['content']}"
            for doc in state["ranked_docs"]
        ])

        prompt = f"""Based on the following documents, answer the question.

Context:
{context}

Question: {state['query']}

Answer:"""

        # Call Ollama for local inference
        response = ollama.generate(
            model="llama3.2",
            prompt=prompt
        )

        state["answer"] = response["response"]
        state["progress"] = "Generated final answer"
        
        emit_progress_event(state["task_id"], "generation_complete", state["progress"])
        
        return state
        
    def verify_sources(self, state: RAGState) -> RAGState:
        """Verify sources are accurately cited"""
        # Check each cited document exists in ranked_docs
        cited_docs = extract_citations(state["answer"])
        verified = all(doc in state["ranked_docs"] for doc in cited_docs)
        
        state["verified"] = verified
        state["progress"] = "Verified sources" if verified else "Source verification failed"
        
        emit_progress_event(state["task_id"], "verification_complete", state["progress"])
        
        return state

The workflow executes as a multi-step pipeline, with each step:

  1. Calling MCP tools for data access
  2. Updating state
  3. Emitting progress events via A2A
  4. Handling errors gracefully

Part 5: Production-Grade Features

1. Authentication & Security

JWT Token Generation (Streamlit UI):

# streamlit-ui/pages/1_?_Authentication.py
def generate_jwt_token(tenant_id: str, user_id: str, ttl: int = 3600) -> str:
    """Generate RS256 JWT token with proper claims"""
    now = datetime.now(timezone.utc)

    payload = {
        "tenant_id": tenant_id,
        "user_id": user_id,
        "iat": now,              # Issued at
        "exp": now + timedelta(seconds=ttl),  # Expiration
        "nbf": now,              # Not before
        "jti": str(uuid.uuid4()), # JWT ID (for revocation)
        "iss": "mcp-demo-ui",    # Issuer
        "aud": "mcp-server"      # Audience
    }

    # Sign with RSA private key
    with open("/app/certs/private_key.pem", "rb") as f:
        private_key = serialization.load_pem_private_key(
            f.read(),
            password=None
        )

    token = jwt.encode(payload, private_key, algorithm="RS256")
    return token

Token Validation (MCP Server):

// mcp-server/internal/middleware/auth.go
func AuthMiddleware(validator *auth.JWTValidator) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // 1. Extract token from Authorization header
            authHeader := r.Header.Get("Authorization")
            if authHeader == "" {
                http.Error(w, "missing authorization header", http.StatusUnauthorized)
                return
            }

            tokenString := strings.TrimPrefix(authHeader, "Bearer ")
            
            // 2. Validate token signature and claims
            claims, err := validator.ValidateToken(tokenString)
            if err != nil {
                log.Printf("Token validation failed: %v", err)
                http.Error(w, "invalid token", http.StatusUnauthorized)
                return
            }

            // 3. Check token expiration
            if claims.ExpiresAt.Before(time.Now()) {
                http.Error(w, "token expired", http.StatusUnauthorized)
                return
            }

            // 4. Check token not used before nbf
            if claims.NotBefore.After(time.Now()) {
                http.Error(w, "token not yet valid", http.StatusUnauthorized)
                return
            }

            // 5. Verify audience (prevent token reuse across services)
            if claims.Audience != "mcp-server" {
                http.Error(w, "invalid token audience", http.StatusUnauthorized)
                return
            }

            // 6. Add claims to context for downstream handlers
            ctx := context.WithValue(r.Context(), auth.ContextKeyTenantID, claims.TenantID)
            ctx = context.WithValue(ctx, auth.ContextKeyUserID, claims.UserID)
            ctx = context.WithValue(ctx, auth.ContextKeyJTI, claims.JTI)

            next.ServeHTTP(w, r.WithContext(ctx))
        })
    }
}

Key Security Features:

  • ? RSA-256 signatures (asymmetric cryptography – server can’t forge tokens)
  • ? Short-lived tokens (1-hour default, reduces replay attack window)
  • ? JWT ID (jti) for token revocation
  • ? Audience claim prevents token reuse across services
  • ? Tenant and user context in every request
  • ? Database-level isolation via RLS
  • ? No session state (fully stateless, scales horizontally)

2. Cost Tracking & Budgeting

You can avoid unexpected cost from AI usage by tracking costs per user, model, and request:

# streamlit-ui/pages/4_?_Cost_Tracking.py
class CostTracker:
    def __init__(self):
        self.costs = []
        self.pricing = {
            # Local models (Ollama)
            "llama3.2": 0.0001,      # $0.0001 per 1K tokens
            "mistral": 0.0001,
            
            # OpenAI models
            "gpt-4": 0.03,           # $0.03 per 1K tokens
            "gpt-3.5-turbo": 0.002,  # $0.002 per 1K tokens
            
            # Anthropic models
            "claude-3": 0.015,       # $0.015 per 1K tokens
            "claude-3-haiku": 0.0025,
        }

    def track_request(self, user_id: str, model: str, 
                     input_tokens: int, output_tokens: int,
                     metadata: dict = None):
        """Track a single request with detailed token breakdown"""
        
        # Calculate costs
        input_cost = (input_tokens / 1000) * self.pricing.get(model, 0)
        output_cost = (output_tokens / 1000) * self.pricing.get(model, 0)
        total_cost = input_cost + output_cost

        # Store record
        self.costs.append({
            "timestamp": datetime.now(),
            "user_id": user_id,
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "input_cost": input_cost,
            "output_cost": output_cost,
            "total_cost": total_cost,
            "metadata": metadata or {}
        })
        
        return total_cost

    def check_budget(self, user_id: str, budget: float) -> tuple[bool, float]:
        """Check if user is within budget"""
        user_costs = [
            c["total_cost"] for c in self.costs
            if c["user_id"] == user_id
        ]

        total_spent = sum(user_costs)
        remaining = budget - total_spent
        
        return remaining > 0, remaining

    def get_usage_by_model(self, user_id: str) -> dict:
        """Get cost breakdown by model"""
        model_costs = {}
        
        for cost in self.costs:
            if cost["user_id"] == user_id:
                model = cost["model"]
                if model not in model_costs:
                    model_costs[model] = {
                        "requests": 0,
                        "total_tokens": 0,
                        "total_cost": 0.0
                    }
                
                model_costs[model]["requests"] += 1
                model_costs[model]["total_tokens"] += cost["input_tokens"] + cost["output_tokens"]
                model_costs[model]["total_cost"] += cost["total_cost"]
        
        return model_costs

Budget Overview Dashboard:

The UI shows:

  • ? Budget remaining per user
  • ? Cost distribution by model (pie chart)
  • ? 7-day spending trend (line chart)
  • ? Alerts when approaching budget limits
  • ? Export to CSV/JSON for accounting

Real-world budget tiers:

# Budget enforcement by user tier
BUDGET_TIERS = {
    "free": {
        "monthly_budget": 0.50,      # $0.50/month
        "rate_limit": 10,            # 10 req/min
        "models": ["llama3.2"]       # Local only
    },
    "pro": {
        "monthly_budget": 25.00,     # $25/month
        "rate_limit": 100,           # 100 req/min
        "models": ["llama3.2", "gpt-3.5-turbo", "claude-3-haiku"]
    },
    "enterprise": {
        "monthly_budget": 500.00,    # $500/month
        "rate_limit": 1000,          # 1000 req/min
        "models": ["*"]              # All models
    }
}

3. Observability with Structured Logging

Langfuse can be integraed for production observability:

# orchestration/workflows/rag_workflow.py
try:
    from langfuse.decorators import observe, langfuse_context
    LANGFUSE_AVAILABLE = True
except ImportError:
    LANGFUSE_AVAILABLE = False
    # Create no-op decorator for local dev
    def observe(*args, **kwargs):
        def decorator(func):
            return func
        return decorator if not args else decorator(args[0])

@observe(name="rag_workflow")
def run_rag_workflow(query: str, user_id: str, tenant_id: str) -> str:
    """Run RAG workflow with observability"""
    workflow = RAGWorkflow(mcp_url="http://mcp-server:8080")

    result = workflow.run({
        "query": query,
        "user_id": user_id,
        "tenant_id": tenant_id
    })

    if LANGFUSE_AVAILABLE:
        # Add metadata for debugging
        langfuse_context.update_current_trace(
            metadata={
                "documents_found": len(result["documents"]),
                "top_score": result["ranked_docs"][0]["score"],
                "model": "llama3.2",
                "tenant_id": tenant_id,
                "user_id": user_id
            },
            tags=["rag", "production", tenant_id]
        )

    return result["answer"]

This gives you:

  • Trace every workflow execution with timing
  • Track tool calls and latencies per step
  • Debug failed runs with full context
  • Monitor token usage and costs
  • Analyze performance across tenants

All with zero impact when Langfuse isn’t installed—perfect for local development.

4. Rate Limiting

You can protect servers from abuse:

// mcp-server/internal/middleware/ratelimit.go
import "golang.org/x/time/rate"

func RateLimitMiddleware(limiter *rate.Limiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            if !limiter.Allow() {
                http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
}

// Usage: 100 requests per second per tenant
limiter := rate.NewLimiter(100, 200) // 100 req/sec burst of 200

Per-tenant rate limiting with Redis:

// mcp-server/internal/middleware/ratelimit_redis.go
type RedisRateLimiter struct {
    client *redis.Client
    limit  int
    window time.Duration
}

func (r *RedisRateLimiter) Allow(ctx context.Context, tenantID string) (bool, error) {
    key := fmt.Sprintf("ratelimit:tenant:%s", tenantID)
    
    // Increment counter
    count, err := r.client.Incr(ctx, key).Result()
    if err != nil {
        return false, err
    }
    
    // Set expiration on first request
    if count == 1 {
        r.client.Expire(ctx, key, r.window)
    }
    
    // Check limit
    return count <= int64(r.limit), nil
}

Part 6: Testing

Unit tests with mocks aren’t enough. You need integration tests against real databases to catch:

  • NULL value handling in PostgreSQL
  • Row-level security policies
  • Concurrent access patterns
  • Real embedding operations with pgvector
  • JSON-RPC protocol edge cases
  • JWT token validation
  • Rate limiting behavior

Integration Test Suite

Here’s what I built:

// mcp-server/internal/database/postgres_integration_test.go
func TestGetDocument_WithNullEmbedding(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    ctx := context.Background()

    // Insert document WITHOUT embedding (common in real world)
    testDoc := &Document{
        TenantID:  testTenantID,
        Title:     "Test Document Without Embedding",
        Content:   "This document has no embedding vector",
        Metadata:  map[string]interface{}{"test": true},
        Embedding: nil, // Explicitly no embedding
    }

    err := db.InsertDocument(ctx, testTenantID, testDoc)
    require.NoError(t, err)

    // Retrieve - should NOT fail with NULL scan error
    retrieved, err := db.GetDocument(ctx, testTenantID, testDoc.ID)
    require.NoError(t, err)
    assert.NotNil(t, retrieved)
    assert.Nil(t, retrieved.Embedding) // Embedding is NULL
    assert.Equal(t, testDoc.Title, retrieved.Title)
    assert.Equal(t, testDoc.Content, retrieved.Content)

    // Cleanup
    db.DeleteDocument(ctx, testTenantID, testDoc.ID)
}

func TestHybridSearch_HandlesNullEmbeddings(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    ctx := context.Background()

    // Insert documents with and without embeddings
    docWithEmbedding := createDocumentWithEmbedding(t, db, testTenantID, "AI Guide")
    docWithoutEmbedding := createDocumentWithoutEmbedding(t, db, testTenantID, "ML Tutorial")

    // Create query embedding
    queryEmbedding := make([]float32, 1536)
    for i := range queryEmbedding {
        queryEmbedding[i] = 0.1
    }

    params := HybridSearchParams{
        Query:        "artificial intelligence machine learning",
        Embedding:    queryEmbedding,
        Limit:        10,
        BM25Weight:   0.5,
        VectorWeight: 0.5,
    }

    // Should work even with NULL embeddings
    results, err := db.HybridSearch(ctx, testTenantID, params)
    require.NoError(t, err)
    assert.NotNil(t, results)
    assert.Greater(t, len(results), 0)

    // Documents without embeddings get vector_score = 0
    for _, result := range results {
        if result.Document.Embedding == nil {
            assert.Equal(t, 0.0, result.VectorScore)
            assert.Greater(t, result.BM25Score, 0.0) // But BM25 should work
        }
    }
}

func TestTenantIsolation_CannotAccessOtherTenant(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    tenant1ID := "tenant-1-" + uuid.New().String()
    tenant2ID := "tenant-2-" + uuid.New().String()

    // Create documents for both tenants
    doc1 := createDocument(t, db, tenant1ID, "Tenant 1 Secret Data")
    doc2 := createDocument(t, db, tenant2ID, "Tenant 2 Secret Data")

    // Query as tenant-1
    ctx1 := context.Background()
    results1, err := db.ListDocuments(ctx1, tenant1ID, ListParams{Limit: 100})
    require.NoError(t, err)

    // Query as tenant-2
    ctx2 := context.Background()
    results2, err := db.ListDocuments(ctx2, tenant2ID, ListParams{Limit: 100})
    require.NoError(t, err)

    // Verify isolation
    assert.Contains(t, results1, doc1)
    assert.NotContains(t, results1, doc2) // ? Cannot see other tenant

    assert.Contains(t, results2, doc2)
    assert.NotContains(t, results2, doc1) // ? Cannot see other tenant
}

func TestConcurrentRetrievals_NoRaceConditions(t *testing.T) {
    db := setupTestDB(t)
    defer db.Close()

    // Create test documents
    docs := make([]*Document, 50)
    for i := 0; i < 50; i++ {
        docs[i] = createDocument(t, db, testTenantID, fmt.Sprintf("Document %d", i))
    }

    // Concurrent retrievals
    var wg sync.WaitGroup
    errors := make(chan error, 500)

    for worker := 0; worker < 10; worker++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            for i := 0; i < 50; i++ {
                doc := docs[i]
                retrieved, err := db.GetDocument(context.Background(), testTenantID, doc.ID)
                if err != nil {
                    errors <- err
                    return
                }
                if retrieved.ID != doc.ID {
                    errors <- fmt.Errorf("document mismatch: got %s, want %s", retrieved.ID, doc.ID)
                    return
                }
            }
        }()
    }

    wg.Wait()
    close(errors)

    // Check for errors
    for err := range errors {
        t.Error(err)
    }
}

Test Coverage:

  • ? GetDocument with/without embeddings (NULL handling)
  • ? ListDocuments with mixed states
  • ? SearchDocuments with NULL embeddings
  • ? HybridSearch graceful degradation
  • ? Tenant isolation enforcement (security)
  • ? Concurrent access (10 workers, 50 requests each)
  • ? All 10 sample documents retrievable
  • ? JSON-RPC protocol validation
  • ? JWT token validation
  • ? Rate limiting behavior

Running Tests

# Unit tests (fast, no dependencies)
cd mcp-server
go test -v ./...

# Integration tests (requires PostgreSQL)
./scripts/run-integration-tests.sh

The integration test script:

  1. Checks if PostgreSQL is running
  2. Waits for database ready
  3. Runs all integration tests
  4. Reports coverage

Output:

? Running MCP Server Integration Tests
========================================
? PostgreSQL is ready

? Running integration tests...

=== RUN   TestGetDocument_WithNullEmbedding
--- PASS: TestGetDocument_WithNullEmbedding (0.05s)
=== RUN   TestGetDocument_WithEmbedding
--- PASS: TestGetDocument_WithEmbedding (0.04s)
=== RUN   TestHybridSearch_HandlesNullEmbeddings
--- PASS: TestHybridSearch_HandlesNullEmbeddings (0.12s)
=== RUN   TestTenantIsolation
--- PASS: TestTenantIsolation (0.08s)
=== RUN   TestConcurrentRetrievals
--- PASS: TestConcurrentRetrievals (2.34s)

PASS
coverage: 95.3% of statements
ok  	github.com/bhatti/mcp-a2a-go/mcp-server/internal/database	3.456s

? Integration tests completed!

Part 7: Real-World Use Cases

Use Case 1: Enterprise RAG Search

Scenario: Consulting firm managing 50,000+ contract documents across multiple clients. Each client (tenant) must have complete data isolation. Legal team needs to:

  • Search with exact terms (case citations, contract clauses)
  • Find semantically similar clauses (non-obvious connections)
  • Track who accessed what (audit trail)
  • Enforce budget limits per client matter

Solution: Hybrid search combining BM25 (keywords) and vector similarity (semantics).

# Client code
results = mcp_client.hybrid_search(
    query="data breach notification requirements GDPR Article 33",
    limit=10,
    bm25_weight=0.6,  # Favor exact keyword matches for legal terms
    vector_weight=0.4  # But include semantic similarity
)

for result in results:
    print(f"Document: {result['title']}")
    print(f"BM25 Score: {result['bm25_score']:.2f}")
    print(f"Vector Score: {result['vector_score']:.2f}")
    print(f"Combined: {result['score']:.2f}")
    print(f"Tenant: {result['tenant_id']}")
    print("---")

Output:

Document: GDPR Compliance Framework - Article 33 Analysis
BM25 Score: 0.89  (matched "GDPR", "Article 33", "notification")
Vector Score: 0.76  (understood "data breach requirements")
Combined: 0.84
Tenant: client-acme-legal

Document: Data Breach Response Procedures
BM25 Score: 0.45  (matched "data breach", "notification")
Vector Score: 0.91  (strong semantic match)
Combined: 0.65
Tenant: client-acme-legal

Document: SEC Disclosure Requirements
BM25 Score: 0.78  (matched "requirements", "notification")
Vector Score: 0.52  (weak semantic match)
Combined: 0.67
Tenant: client-acme-legal

Benefits:

  • ? Finds documents with exact terms (“GDPR”, “Article 33”)
  • ? Surfaces semantically similar docs (“privacy breach”, “data protection”)
  • ? Tenant isolation ensures Client A can’t see Client B’s contracts
  • ? Audit trail via structured logging
  • ? Cost tracking per client matter

Use Case 2: Multi-Step Research Workflows

Scenario: Investment analyst needs to research a company across multiple data sources:

  1. Company filings (10-K, 10-Q, 8-K)
  2. Competitor analysis
  3. Market trends
  4. Financial metrics
  5. Regulatory filings
  6. News sentiment

Traditional RAG: Query each source separately, manually synthesize results.

With A2A + MCP: Orchestrate multi-step workflow with progress tracking.

# orchestration/workflows/research_workflow.py
class ResearchWorkflow:
    def build_workflow(self):
        workflow = StateGraph(ResearchState)

        # Define research steps
        workflow.add_node("search_company", self.search_company_docs)
        workflow.add_node("search_competitors", self.search_competitors)
        workflow.add_node("search_financials", self.search_financial_data)
        workflow.add_node("analyze_trends", self.analyze_market_trends)
        workflow.add_node("verify_facts", self.verify_with_sources)
        workflow.add_node("generate_report", self.generate_final_report)

        # Define workflow graph
        workflow.add_edge(START, "search_company")
        workflow.add_edge("search_company", "search_competitors")
        workflow.add_edge("search_competitors", "search_financials")
        workflow.add_edge("search_financials", "analyze_trends")
        workflow.add_edge("analyze_trends", "verify_facts")
        workflow.add_edge("verify_facts", "generate_report")
        workflow.add_edge("generate_report", END)

        return workflow.compile()
    
    def search_company_docs(self, state: ResearchState) -> ResearchState:
        """Step 1: Search company documents via MCP"""
        company = state["company_name"]
        
        # Call MCP hybrid search
        results = self.mcp_client.hybrid_search(
            query=f"{company} business operations revenue products",
            limit=20,
            bm25_weight=0.5,
            vector_weight=0.5
        )
        
        state["company_docs"] = results
        state["progress"] = f"Found {len(results)} company documents"
        
        # Emit progress via A2A SSE
        emit_progress("search_company_complete", state["progress"])
        
        return state
    
    def search_competitors(self, state: ResearchState) -> ResearchState:
        """Step 2: Identify and search competitors"""
        company = state["company_name"]
        
        # Extract competitors from company docs
        competitors = self.extract_competitors(state["company_docs"])
        
        # Search each competitor
        competitor_data = {}
        for competitor in competitors:
            results = self.mcp_client.hybrid_search(
                query=f"{competitor} market share products revenue",
                limit=10
            )
            competitor_data[competitor] = results
        
        state["competitors"] = competitor_data
        state["progress"] = f"Analyzed {len(competitors)} competitors"
        
        emit_progress("search_competitors_complete", state["progress"])
        
        return state
    
    def search_financial_data(self, state: ResearchState) -> ResearchState:
        """Step 3: Extract financial metrics"""
        company = state["company_name"]
        
        # Search for financial documents
        results = self.mcp_client.hybrid_search(
            query=f"{company} revenue earnings profit margin cash flow",
            limit=15,
            bm25_weight=0.7,  # Favor exact financial terms
            vector_weight=0.3
        )
        
        # Extract key metrics
        metrics = self.extract_financial_metrics(results)
        
        state["financials"] = metrics
        state["progress"] = f"Extracted {len(metrics)} financial metrics"
        
        emit_progress("search_financials_complete", state["progress"])
        
        return state
    
    def verify_facts(self, state: ResearchState) -> ResearchState:
        """Step 5: Verify all facts with sources"""
        # Check each claim has supporting document
        claims = self.extract_claims(state["report_draft"])
        
        verified_claims = []
        for claim in claims:
            sources = self.find_supporting_docs(claim, state)
            if sources:
                verified_claims.append({
                    "claim": claim,
                    "sources": sources,
                    "verified": True
                })
        
        state["verified_claims"] = verified_claims
        state["progress"] = f"Verified {len(verified_claims)} claims"
        
        emit_progress("verification_complete", state["progress"])
        
        return state

Benefits:

  • ? Multi-step orchestration with state management
  • ? Real-time progress via SSE (analyst sees each step)
  • ? Intermediate results saved as artifacts
  • ? Each step calls MCP tools for data retrieval
  • ? Final report with verified sources
  • ? Cost tracking across all steps

Use Case 3: Budget-Controlled AI Assistance

Scenario: SaaS company (e.g., document management platform) offers AI features to customers based on tiered subscription: Without budget control: Customer on free tier makes 10,000 queries in one day.

With budget control:

# Before each request
tier = get_user_tier(user_id)
budget = BUDGET_TIERS[tier]["monthly_budget"]
allowed, remaining = cost_tracker.check_budget(user_id, budget)

if not allowed:
    raise BudgetExceededError(
        f"Monthly budget of ${budget} exceeded. "
        f"Upgrade to {next_tier} for higher limits."
    )

# Track the request
response = llm.generate(prompt)
cost = cost_tracker.track_request(
    user_id=user_id,
    model="llama3.2",
    input_tokens=len(prompt.split()),
    output_tokens=len(response.split())
)

# Alert when approaching limit
if remaining < 5.0:  # $5 remaining
    send_alert(user_id, f"Budget alert: ${remaining:.2f} remaining")

Real-world budget enforcement:

# streamlit-ui/pages/4_?_Cost_Tracking.py
def enforce_budget_limits():
    """Check budget before task creation"""
    
    user_tier = st.session_state.get("user_tier", "free")
    budget = BUDGET_TIERS[user_tier]["monthly_budget"]
    
    # Calculate current spend
    spent = cost_tracker.get_total_cost(user_id)
    remaining = budget - spent
    
    # Display budget status
    col1, col2, col3 = st.columns(3)
    
    with col1:
        st.metric("Budget", f"${budget:.2f}")
    
    with col2:
        st.metric("Spent", f"${spent:.2f}", 
                 delta=f"-${spent:.2f}", delta_color="inverse")
    
    with col3:
        progress = (spent / budget) * 100
        st.metric("Remaining", f"${remaining:.2f}")
        st.progress(progress / 100)
    
    # Block if exceeded
    if remaining <= 0:
        st.error("? Monthly budget exceeded. Upgrade to continue.")
        st.button("Upgrade to Pro ($25/month)", on_click=upgrade_tier)
        return False
    
    # Warn if close
    if remaining < 5.0:
        st.warning(f"?? Budget alert: Only ${remaining:.2f} remaining this month")
    
    return True

Benefits:

  • ? Prevent cost overruns per customer
  • ? Fair usage enforcement across tiers
  • ? Export data for billing/accounting
  • ? Different limits per tier
  • ? Automatic alerts before limits
  • ? Graceful degradation (local models for free tier)

Part 8: Deployment & Operations

Docker Compose Setup

Everything runs in containers with health checks:

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: mcp_db
      POSTGRES_USER: mcp_user
      POSTGRES_PASSWORD: ${DB_PASSWORD:-mcp_secure_pass}
    volumes:
      - ./scripts/init-db.sql:/docker-entrypoint-initdb.d/init.sql
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mcp_user -d mcp_db"]
      interval: 5s
      timeout: 5s
      retries: 5

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 3

  mcp-server:
    build:
      context: ./mcp-server
      dockerfile: Dockerfile
    environment:
      DB_HOST: postgres
      DB_PORT: 5432
      DB_USER: mcp_user
      DB_PASSWORD: ${DB_PASSWORD:-mcp_secure_pass}
      DB_NAME: mcp_db
      JWT_PUBLIC_KEY_PATH: /app/certs/public_key.pem
      OLLAMA_URL: http://ollama:11434
      LOG_LEVEL: ${LOG_LEVEL:-info}
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      ollama:
        condition: service_healthy
    volumes:
      - ./certs:/app/certs:ro
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  a2a-server:
    build:
      context: ./a2a-server
      dockerfile: Dockerfile
    environment:
      MCP_SERVER_URL: http://mcp-server:8080
      OLLAMA_URL: http://ollama:11434
      LOG_LEVEL: ${LOG_LEVEL:-info}
    ports:
      - "8082:8082"
    depends_on:
      - mcp-server
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8082/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  streamlit-ui:
    build:
      context: ./streamlit-ui
      dockerfile: Dockerfile
    environment:
      MCP_SERVER_URL: http://mcp-server:8080
      A2A_SERVER_URL: http://a2a-server:8082
    ports:
      - "8501:8501"
    volumes:
      - ./certs:/app/certs:ro
    depends_on:
      - mcp-server
      - a2a-server

volumes:
  postgres_data:
  ollama_data:

Startup & Verification

# Start all services
docker compose up -d

# Check status
docker compose ps

# Expected output:
# NAME              STATUS        PORTS
# postgres          Up (healthy)  0.0.0.0:5432->5432/tcp
# ollama            Up (healthy)  0.0.0.0:11434->11434/tcp
# mcp-server        Up (healthy)  0.0.0.0:8080->8080/tcp
# a2a-server        Up (healthy)  0.0.0.0:8082->8082/tcp
# streamlit-ui      Up            0.0.0.0:8501->8501/tcp

# View logs
docker compose logs -f mcp-server
docker compose logs -f a2a-server

# Run health checks
curl http://localhost:8080/health  # MCP server
curl http://localhost:8082/health  # A2A server

# Pull Ollama model
docker compose exec ollama ollama pull llama3.2

# Initialize database with sample data
docker compose exec postgres psql -U mcp_user -d mcp_db -f /docker-entrypoint-initdb.d/init.sql

Production Considerations

1. Environment Variables (Don’t Hardcode Secrets)

# .env.production
DB_PASSWORD=$(openssl rand -base64 32)
JWT_PRIVATE_KEY_PATH=/secrets/jwt_private_key.pem
JWT_PUBLIC_KEY_PATH=/secrets/jwt_public_key.pem
LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}
LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}
OLLAMA_URL=http://ollama:11434
LOG_LEVEL=info
SENTRY_DSN=${SENTRY_DSN}

2. Database Migrations

Use golang-migrate for schema management:

# Install migrate
curl -L https://github.com/golang-migrate/migrate/releases/download/v4.16.2/migrate.linux-amd64.tar.gz | tar xvz
mv migrate /usr/local/bin/

# Create migration
migrate create -ext sql -dir db/migrations -seq add_embeddings_index

# Apply migrations
migrate -path db/migrations \
        -database "postgresql://user:pass@localhost:5432/db?sslmode=disable" \
        up

# Rollback if needed
migrate -path db/migrations \
        -database "postgresql://user:pass@localhost:5432/db?sslmode=disable" \
        down 1

3. Kubernetes Deployment

The repository includes Kubernetes manifests:

# k8s/mcp-server-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
  namespace: mcp-a2a
spec:
  replicas: 3  # High availability
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
      - name: mcp-server
        image: ghcr.io/bhatti/mcp-server:latest
        ports:
        - containerPort: 8080
        env:
        - name: DB_HOST
          value: postgres-service
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: JWT_PUBLIC_KEY_PATH
          value: /certs/public_key.pem
        volumeMounts:
        - name: certs
          mountPath: /certs
          readOnly: true
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3
      volumes:
      - name: certs
        secret:
          secretName: jwt-certs

Deploy to Kubernetes:

# Create namespace
kubectl create namespace mcp-a2a

# Apply secrets
kubectl create secret generic db-credentials \
  --from-literal=password=$(openssl rand -base64 32) \
  -n mcp-a2a

kubectl create secret generic jwt-certs \
  --from-file=public_key.pem=./certs/public_key.pem \
  --from-file=private_key.pem=./certs/private_key.pem \
  -n mcp-a2a

# Apply manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/postgres.yaml
kubectl apply -f k8s/mcp-server.yaml
kubectl apply -f k8s/a2a-server.yaml
kubectl apply -f k8s/streamlit-ui.yaml

# Check pods
kubectl get pods -n mcp-a2a

# View logs
kubectl logs -f deployment/mcp-server -n mcp-a2a

# Scale up
kubectl scale deployment mcp-server --replicas=5 -n mcp-a2a

4. Monitoring & Alerts

Add Prometheus metrics:

// mcp-server/internal/metrics/prometheus.go
var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "mcp_request_duration_seconds",
            Help: "MCP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "status"},
    )

    activeRequests = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "mcp_active_requests",
            Help: "Number of active MCP requests",
        },
    )
    
    hybridSearchQueries = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "mcp_hybrid_search_queries_total",
            Help: "Total number of hybrid search queries",
        },
        []string{"tenant_id"},
    )
    
    budgetExceeded = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "mcp_budget_exceeded_total",
            Help: "Number of requests blocked due to budget limits",
        },
        []string{"user_id", "tier"},
    )
)

func init() {
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeRequests)
    prometheus.MustRegister(hybridSearchQueries)
    prometheus.MustRegister(budgetExceeded)
}

Alert rules (Prometheus):

# prometheus/alerts.yml
groups:
- name: mcp_alerts
  rules:
  - alert: HighErrorRate
    expr: rate(mcp_request_duration_seconds_count{status="error"}[5m]) > 0.1
    for: 5m
    annotations:
      summary: "High error rate on MCP server"
      description: "Error rate is {{ $value }} errors/sec"
  
  - alert: BudgetExceededRate
    expr: rate(mcp_budget_exceeded_total[1h]) > 100
    annotations:
      summary: "High budget exceeded rate"
      description: "{{ $value }} users hitting budget limits per hour"
  
  - alert: DatabaseLatency
    expr: mcp_request_duration_seconds{method="hybrid_search"} > 1.0
    for: 2m
    annotations:
      summary: "Slow hybrid search queries"
      description: "Hybrid search taking {{ $value }}s (should be <1s)"

5. Backup & Recovery

Automated PostgreSQL backups:

#!/bin/bash
# scripts/backup-database.sh

BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="mcp_db"
DB_USER="mcp_user"

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Dump database
docker compose exec -T postgres pg_dump -U ${DB_USER} ${DB_NAME} | \
    gzip > ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz

# Upload to S3 (optional)
aws s3 cp ${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz \
    s3://my-backups/mcp-db/

# Keep last 7 days locally
find ${BACKUP_DIR} -name "${DB_NAME}_*.sql.gz" -mtime +7 -delete

echo "Backup completed: ${DB_NAME}_${TIMESTAMP}.sql.gz"

Part 9: Performance & Scalability

Benchmarks (Single Instance)

MCP Server (Go):

Benchmark: Hybrid Search (10 results, 1536-dim embeddings)
- Requests/sec: 5,247
- P50 latency: 12ms
- P95 latency: 45ms
- P99 latency: 89ms
- Memory: 52MB baseline, 89MB under load
- CPU: 23% average (4 cores)

Database (PostgreSQL + pgvector):

Benchmark: Vector search (cosine similarity)
- Documents: 100,000
- Embedding dimensions: 1536
- Index: HNSW (m=16, ef_construction=64)
- Query time: <100ms (P95)
- Throughput: 150 queries/sec (single connection)
- Concurrent queries: 100+ simultaneous

Why these numbers matter:

  • 5,000+ req/sec means 432 million requests/day per instance
  • <100ms search means interactive UX
  • 52MB memory means cost-effective scaling

Load Testing Results

# Using hey (HTTP load generator)
hey -n 10000 -c 100 -m POST \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"hybrid_search","arguments":{"query":"machine learning","limit":10}}}' \
    http://localhost:8080/mcp

Summary:
  Total:        19.8421 secs
  Slowest:      0.2847 secs
  Fastest:      0.0089 secs
  Average:      0.1974 secs
  Requests/sec: 503.98
  
  Status code distribution:
    [200]	10000 responses

Latency distribution:
  10% in 0.0234 secs
  25% in 0.0456 secs
  50% in 0.1842 secs
  75% in 0.3123 secs
  90% in 0.4234 secs
  95% in 0.4867 secs
  99% in 0.5634 secs

Scaling Strategy

Horizontal Scaling:

  1. MCP and A2A servers are stateless—scale with container replicas
  2. Database read replicas for read-heavy workloads (search queries)
  3. Redis cache for frequently accessed queries (30-second TTL)
  4. Load balancer distributes requests (sticky sessions not needed)

Vertical Scaling:

  1. Increase PostgreSQL resources for larger datasets
  2. Add pgvector HNSW indexes for faster vector search
  3. Tune connection pool sizes (PgBouncer)

When to scale what:

SymptomSolution
High MCP server CPUAdd more MCP replicas
Slow database queriesAdd read replicas
High memory on MCPCheck for memory leaks, add replicas
Cache missesIncrease Redis memory, tune TTL
Slow embeddingsDeploy dedicated embedding service

Part 10: Lessons Learned & Best Practices

1. Go for Protocol Servers

Go’s performance and type safety provides a good support for AI deployment in production.

2. PostgreSQL Row-Level Security

Database-level tenant isolation is non-negotiable for enterprise. Application-level filtering is too easy to screw up. With RLS, even if your application has a bug, the database enforces isolation.

3. Integration Tests Against Real Databases

Unit tests with mocks didn’t catch the NULL embedding issues. Integration tests did. Test against production-like environments.

4. Optional Langfuse

Making Langfuse optional (try/except imports) lets developers run locally without complex setup while enabling full observability in production.

5. Comprehensive Documentation

Document your design and testing process from day one.

6. Structured Logging

Add structured logging (JSON format):

// ? Structured logging
log.Info().
    Str("tenant_id", tenantID).
    Str("user_id", userID).
    Int("results_count", len(results)).
    Float64("duration_ms", duration.Milliseconds()).
    Msg("hybrid search completed")

Benefits of structured logging:

  • Easy filtering: jq '.tenant_id == "acme-corp"' logs.json
  • Metrics extraction: jq -r '.duration_ms' logs.json | stats
  • Correlation: Trace requests across services
  • Alerting: Monitor error patterns

7. Rate Limiting Per Tenant (Not Global)

Implement per-tenant rate limiting using Redis or other similar frameworks:

// ? Per-tenant rate limiting
type RedisRateLimiter struct {
    client *redis.Client
}

func (r *RedisRateLimiter) Allow(ctx context.Context, tenantID string, limit int) (bool, error) {
    key := fmt.Sprintf("ratelimit:tenant:%s", tenantID)
    
    pipe := r.client.Pipeline()
    incr := pipe.Incr(ctx, key)
    pipe.Expire(ctx, key, time.Minute)
    _, err := pipe.Exec(ctx)
    if err != nil {
        return false, err
    }
    
    count, err := incr.Result()
    if err != nil {
        return false, err
    }
    
    return count <= int64(limit), nil
}

Why this matters:

  • One tenant can’t DoS the system
  • Fair resource allocation
  • Tiered pricing based on limits
  • Tenant-specific SLAs

8. Embedding Generation Service

Ollama works, but a dedicated embedding service (e.g., sentence-transformers FastAPI service) would be:

  • Faster: Batch processing
  • More reliable: Health checks, retries
  • Scalable: Independent scaling
# embeddings-service/app.py (what I should have built)
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.post("/embed")
async def embed(texts: list[str]):
    embeddings = model.encode(texts, batch_size=32)
    return {"embeddings": embeddings.tolist()}

9. Circuit Breaker Pattern

When Ollama is down, the entire system hangs waiting for embeddings so implement circuit breaker for graceful fallback strategies:

// ? Circuit breaker pattern
type CircuitBreaker struct {
    maxFailures int
    timeout     time.Duration
    failures    int
    lastFail    time.Time
    state       string  // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == "open" {
        if time.Since(cb.lastFail) > cb.timeout {
            cb.state = "half-open"
        } else {
            return fmt.Errorf("circuit breaker open")
        }
    }
    
    err := fn()
    if err != nil {
        cb.failures++
        cb.lastFail = time.Now()
        
        if cb.failures >= cb.maxFailures {
            cb.state = "open"
        }
        return err
    }
    
    cb.failures = 0
    cb.state = "closed"
    return nil
}

Production Checklist

Before going live, ensure you have:

Security:

  • ? JWT authentication with RSA keys
  • ? Row-level security enforced at database
  • ? Secrets in environment variables (not hardcoded)
  • ? HTTPS/TLS certificates
  • ? API key rotation policy
  • ? Audit logging for sensitive operations

Scalability:

  • ? Stateless servers (can scale horizontally)
  • ? Database connection pooling (PgBouncer)
  • ? Read replicas for query workloads
  • ? Caching layer (Redis)
  • ? Load balancer configured
  • ? Auto-scaling rules defined

Observability:

  • ? Structured logging (JSON format)
  • ? Distributed tracing (Jaeger/Zipkin)
  • ? Metrics collection (Prometheus)
  • ? Dashboards (Grafana)
  • ? Alerting rules configured
  • ? On-call rotation defined

Reliability:

  • ? Health check endpoints (/health)
  • ? Graceful shutdown handlers
  • ? Rate limiting implemented
  • ? Budget enforcement active
  • ? Circuit breakers for external services
  • ? Backup strategy automated

Testing:

  • ? Integration tests passing (95%+ coverage)
  • ? Load testing completed
  • ? Security testing (pen test)
  • ? Disaster recovery tested
  • ? Rollback procedure documented

Operations:

  • ? Deployment automation (CI/CD)
  • ? Monitoring alerts configured
  • ? Runbooks for common issues
  • ? Incident response plan
  • ? Backup and recovery tested
  • ? Capacity planning done

Conclusion: MCP + A2A = Production-Grade AI

Here’s what we built:

? MCP Server – Secure, multi-tenant document retrieval (5,000+ req/sec)
? A2A Server – Stateful workflow orchestration with SSE streaming
? LangGraph Workflows – Multi-step RAG and research pipelines
? 200+ Tests – 95% coverage with integration tests against real databases
? Production Ready – Auth, observability, cost tracking, rate limiting, K8s deployment

But here’s the uncomfortable truth: None of this was in the MCP or A2A specifications. The Protocols Are Just 10% of the Work:

MCP defines:

  • ? JSON-RPC 2.0 message format
  • ? Tool call/response structure
  • ? Resource access patterns

A2A defines:

  • ? Task lifecycle states
  • ? Agent card format
  • ? SSE event structure

What they DON’T define:

  • ? Authentication and authorization
  • ? Multi-tenant isolation
  • ? Rate limiting and cost control
  • ? Observability and tracing
  • ? Circuit breakers and timeouts
  • ? Encryption and compliance
  • ? Disaster recovery

This is by design—protocols define interfaces, not implementations. But it means every production deployment must solve these problems independently.

Why Default Implementations Are Dangerous

Reference implementations are educational tools, not deployment blueprints. Here’s what’s missing:

# ? Typical MCP tutorial
def handle_request(request):
    tool = request["params"]["name"]
    args = request["params"]["arguments"]
    return execute_tool(tool, args)  # No auth, no validation, no limits
// ? Production reality
func (h *MCPHandler) handleToolsCall(ctx context.Context, req *protocol.Request) {
    // 1. Authenticate (JWT validation)
    // 2. Authorize (check permissions)
    // 3. Rate limit (per-tenant quotas)
    // 4. Validate input (prevent injection)
    // 5. Inject tenant context (RLS)
    // 6. Trace request (observability)
    // 7. Track cost (budget enforcement)
    // 8. Circuit breaker (fail fast)
    // 9. Retry logic (handle transients)
    // 10. Audit log (compliance)
    
    return h.toolRegistry.Execute(ctx, toolReq.Name, toolReq.Arguments)
}

That’s 10 layers of production concerns. Miss one, and you have a security incident waiting to happen.

Distributed Systems Lessons Apply Here

AI agents are distributed systems. The problems from microservices apply, because agents make autonomous decisions with potentially unbounded costs. From my fault tolerance article, these patterns are essential:

Without timeouts:

embedding = ollama.embed(text)  # Ollama down ? hangs forever ? system freezes

With timeouts:

ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
embedding, err := ollama.Embed(ctx, text)
if err != nil {
    return db.BM25Search(ctx, query)  // Degrade gracefully, skip embeddings
}

Without circuit breakers:

for task in tasks:
    result = external_api.call(task)  # Fails 1000 times, wastes time/money

With circuit breakers:

if circuitBreaker.IsOpen() {
    return cachedResult  // Fail fast, don't waste resources
}

Without rate limiting:

Tenant A: 10,000 req/sec ? Database crashes ? ALL tenants down

With rate limiting:

if !rateLimiter.Allow(tenantID) {
    return ErrRateLimitExceeded  // Other tenants unaffected
}

The Bottom Line

MCP and A2A are excellent protocols. They solve real problems:

  • ? MCP standardizes tool execution
  • ? A2A standardizes agent coordination

But protocols are not products. Building on MCP/A2A is like building on HTTP—the protocol is solved, but you still need web servers, frameworks, security layers, and monitoring tools.

This repository shows the other 90%:

  • Real authentication (not “TODO: add auth”)
  • Real multi-tenancy (database RLS, not app filtering)
  • Real observability (Langfuse integration, not “we should add logging”)
  • Real testing (integration tests, not just mocks)
  • Real deployment (K8s manifests, not “works on my laptop”)

Get Started

git clone https://github.com/bhatti/mcp-a2a-go
cd mcp-a2a-go
docker compose up -d
./scripts/run-integration-tests.sh
open http://localhost:8501

Resources


November 22, 2025

Testing Distributed Systems Failures with Interactive Simulators

Filed under: Computing — admin @ 10:31 pm

Introduction

Building distributed systems means confronting failure modes that are nearly impossible to reproduce in development or testing environments. How do you test for metastable failures that only emerge under specific load patterns? How do you validate that your quorum-based system actually maintains consistency during network partitions? How do you catch cross-system interaction bugs when both systems work perfectly in isolation? Integration testing, performance testing, and chaos engineering all help, but they have limitations. For the past few years, I’ve been using simulation to validate boundary conditions that are hard to test in real environments. Interactive simulators let you tweak parameters, trigger failure scenarios, and see the consequences immediately through metrics and visualizations.

In this post, I will share four simulators I’ve built to explore the failure modes and consistency challenges that are hardest to test in real systems:

  1. Metastable Failure Simulator: Demonstrates how retry storms create self-sustaining collapse
  2. CAP/PACELC Consistency Simulator: Shows the real tradeoffs between consistency, availability, and latency
  3. CRDT Simulator: Explores conflict-free convergence without coordination
  4. Cross-System Interaction (CSI) Failure Simulator: Reveals how correct systems fail through their interactions

Each simulator is built on research findings and real-world incidents. The goal isn’t just to understand these failure modes intellectually, but to develop intuition through experimentation. All simulators available at: https://github.com/bhatti/simulators.


Part 1: Metastable Failures

The Problem: When Systems Attack Themselves

Metastable failures are particularly insidious because the initial trigger can be small and transient, but the system remains degraded long after the trigger is gone. Research in the metastable failures has shown that traditional fault tolerance mechanisms don’t protect against metastability because the failure is self-sustaining through positive feedback loops in retry logic and coordination overhead. The mechanics are deceptively simple:

  1. A transient issue (network blip, brief CPU spike) causes some requests to slow down
  2. Slow requests start timing out
  3. Clients retry timed-out requests, adding more load
  4. Additional load increases coordination overhead (locks, queues, resource contention)
  5. Higher overhead increases latency further
  6. More timeouts trigger more retries
  7. The system is now in a stable degraded state, even though the original trigger is gone

For example, AWS Kinesis experienced a 7+ hour outage in 2020 where a transient metadata mismatch triggered retry storms across the fleet. Even after the original issue was fixed, the retry behavior kept the system degraded. The recovery required externally rate-limiting client retries.

How the Simulator Works

The metastable failure simulator models this feedback loop using discrete event simulation (SimPy). Here’s what it simulates:

Server Model:

  • Base latency: Time to process a request with no contention
  • Concurrency slope: Additional latency per concurrent request (coordination cost)
  • Capacity: Maximum concurrent requests before queueing
# Latency grows linearly with active requests
def current_latency(self):
    return self.base_latency + (self.active_requests * self.concurrency_slope)

Client Model:

  • Timeout threshold: When to give up on a request
  • Max retries: How many times to retry
  • Backoff strategy: Exponential backoff with jitter (configurable)

Load Patterns:

  • Constant: Steady baseline load
  • Spike: Sudden increase for a duration, then back to baseline
  • Ramp: Gradual increase and decrease

Key Parameters to Experiment With:

ParameterWhat It TestsTypical Values
server_capacityHow many concurrent requests before queueing20-100
base_latencyProcessing time without contention0.1-1.0s
concurrency_slopeCoordination overhead per request0.001-0.05s
timeoutWhen clients give up1-10s
max_retriesRetry attempts before failure0-5
backoff_enabledWhether to add jitter and delaysTrue/False

What You Can Learn:

  1. Trigger a metastable failure: Set spike load high, timeout low, disable backoff ? watch P99 latency stay high after spike ends
  2. See recovery with backoff: Same scenario but enable exponential backoff ? system recovers when spike ends
  3. Understand the tipping point: Gradually increase concurrency slope ? observe when retry amplification begins
  4. Test admission control: Set low server capacity ? see benefit of failing fast vs queueing

The simulator tracks success rate, retry count, timeout count, and latency percentiles over time, letting you see exactly when the system tips into metastability and whether it recovers. With this simulator you can validate various prevention strategies such as:

  • Exponential backoff with jitter spreads retries over time
  • Adaptive retry budgets limit total fleet-wide retries
  • Circuit breakers detect patterns and stop retry storms
  • Load shedding rejects requests before queues explode

Part 2: CAP and PACELC

The CAP theorem correctly states that during network partitions, you must choose between consistency and availability. However, as Daniel Abadi and others have pointed out, this only addresses partition scenarios. Most systems spend 99.99% of their time in normal operation, where the real tradeoff is between latency and consistency. This is where PACELC comes in:

  • If Partition happens: choose Availability or Consistency
  • Else (normal operation): choose Latency or Consistency

PACELC provides a more complete framework for understanding real-world distributed databases:

PA/EL Systems (DynamoDB, Cassandra, Riak):

  • Partition ? Choose Availability (serve stale data)
  • Normal ? Choose Latency (1-2ms reads from any replica)
  • Use when: Shopping carts, session stores, high write throughput needed

PC/EC Systems (Google Spanner, VoltDB, HBase):

  • Partition ? Choose Consistency (reject operations)
  • Normal ? Choose Consistency (5-100ms for quorum coordination)
  • Use when: Financial transactions, inventory, anything that can’t be wrong

PA/EC Systems (MongoDB):

  • Partition ? Choose Availability (with caveats – unreplicated writes go to rollback)
  • Normal ? Choose Consistency (strong reads/writes in baseline)
  • Use when: Mixed workloads with mostly consistent needs

PC/EL Systems (PNUTS):

  • Partition ? Choose Consistency
  • Normal ? Choose Latency (async replication)
  • Use when: Read-heavy with timeline consistency acceptable

Quorum Consensus: Strong Consistency with Coordination

When R + W > N (read quorum + write quorum > total replicas), the read and write sets must overlap in at least one node. This overlap ensures that any read sees at least one node with the latest write, providing linearizability.

Example with N=5, R=3, W=3:

  • Write to replicas {1, 2, 3}
  • Read from replicas {2, 3, 4}
  • Overlap at {2, 3} guarantees we see the latest value

Critical Nuances:

R + W > N alone is NOT sufficient for linearizability in practice. You need additional mechanisms: readers must perform read repair synchronously before returning results, and writers must read the latest state from a quorum before writing. “Last write wins” based on wall-clock time breaks linearizability due to clock skew. Sloppy quorums like those used in Dynamo are NOT linearizable because the nodes in the quorum can change during failures. Even R = W = N doesn’t guarantee consistency if cluster membership changes. Google Spanner uses atomic clocks and GPS to achieve strong consistency globally, with TrueTime API providing less than 1ms clock uncertainty at the 99th percentile as of 2023.

How the Simulator Works

The CAP/PACELC simulator lets you explore these tradeoffs by configuring different consistency models and observing their behavior during normal operation and network partitions.

System Model:

  • N replica nodes, each with local storage
  • Configurable schema for data (to test compatibility)
  • Network latency between nodes (WAN vs LAN)
  • Optional partition mode (splits cluster)

Consistency Levels:

  1. Strong (R+W>N): Quorum reads and writes, linearizable
  2. Linearizable (R=W=N): All nodes must respond, highest consistency
  3. Weak (R=1, W=1): Single node, eventual consistency
  4. Eventual: Async replication, high availability
def get_quorum_size(self, operation_type):
    if self.consistency_level == ConsistencyLevel.STRONG:
        return (self.n_nodes // 2) + 1  # Majority
    elif self.consistency_level == ConsistencyLevel.LINEARIZABLE:
        return self.n_nodes  # All nodes
    elif self.consistency_level == ConsistencyLevel.WEAK:
        return 1  # Any node

Key Parameters:

ParameterWhat It TestsImpact
n_nodesReplica countMore nodes = more fault tolerance but higher coordination cost
consistency_levelStrong/Eventual/etcDirectly controls latency vs consistency tradeoff
base_latencyNode processing timeBaseline performance
network_latencyInter-node delayWAN (50-150ms) vs LAN (1-10ms) dramatically affects quorum cost
partition_activeNetwork partitionTests CAP behavior (A vs C during partition)
write_ratioRead/write mixWrite-heavy shows coordination bottleneck

What You Can Learn:

  1. Latency cost of consistency:
    • Run with Strong (R=3,W=3) at network_latency=5ms ? ~15ms operations
    • Same at network_latency=100ms ? ~300ms operations
    • Switch to Weak (R=1,W=1) ? single-digit milliseconds regardless
  2. CAP during partitions:
    • Enable partition with Strong consistency ? operations fail (choosing C over A)
    • Enable partition with Eventual ? stale reads but available (choosing A over C)
  3. Quorum size tradeoffs:
    • Linearizable (R=W=N) ? single node failure breaks everything
    • Strong (R=W=3 of N=5) ? can tolerate 2 node failures
    • Measure failure rate vs consistency guarantees
  4. Geographic distribution:
    • Network latency 10ms (same datacenter) ? quorum cost moderate
    • Network latency 150ms (cross-continent) ? quorum cost severe
    • Observe when you should use eventual consistency for geo-distribution

The simulator tracks write/read latencies, inconsistent reads, failed operations, and success rates, giving you quantitative data on the tradeoffs.

Key Insights from Simulation

The simulator reveals that most architectural decisions are driven by normal operation latency, not partition handling. If you’re building a global system with 150ms cross-region latency, strong consistency means every operation takes 150ms+ for quorum coordination. That’s often unacceptable for user-facing features. This is why hybrid approaches are becoming standard: use strong consistency for critical invariants (financial transactions, inventory), eventual consistency for everything else (user profiles, preferences).


Part 3: CRDTs

CRDTs (Conflict-Free Replicated Data Types) provide strong eventual consistency (SEC) through mathematical guarantees, not probabilistic convergence. They work without coordination, consensus, or concurrency control. CRDTs rely on operations being commutative (order doesn’t matter), merge functions being associative and idempotent (forming a semilattice), and updates being monotonic according to a partial order.

Example: G-Counter (Grow-Only Counter)

class GCounter:
    def __init__(self, replica_id):
        self.counts = {}  # replica_id -> count
    
    def increment(self, amount=1):
        # Each replica tracks its own increments
        self.counts[self.replica_id] = self.counts.get(self.replica_id, 0) + amount
    
    def value(self):
        # Total is sum of all replicas
        return sum(self.counts.values())
    
    def merge(self, other):
        # Take max of each replica's count
        for replica_id, count in other.counts.items():
            self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)

Why this works:

  • Each replica only increments its own counter (no conflicts)
  • Merge takes max (idempotent: max(a,a) = a)
  • Order doesn’t matter: max(max(a,b),c) = max(a,max(b,c))
  • Eventually all replicas see all increments ? convergence

CRDT Types

There are two main approaches: State-based CRDTs (CvRDTs) send full local state and require merge functions to be commutative, associative, and idempotent. Operation-based CRDTs (CmRDTs) transmit only update operations and require reliable delivery in causal order. Delta-state CRDTs combine the advantages by transmitting compact deltas.

Four CRDTs in the Simulator:

  1. G-Counter: Increment only, perfect for metrics
  2. PN-Counter: Increment and decrement (two G-Counters)
  3. OR-Set: Add/remove elements, concurrent add wins
  4. LWW-Map: Last-write-wins with timestamps

Production systems using CRDTs include Redis Enterprise (CRDBs), Riak, Azure Cosmos DB for distributed data types, and Automerge/Yjs for collaborative editing like Google Docs. SoundCloud uses CRDTs in their audio distribution platform.

Important Limitations

CRDTs only provide eventual consistency, NOT strong consistency or linearizability. Different replicas can see concurrent operations in different orders temporarily. Not all operations are naturally commutative, and CRDTs cannot solve problems requiring atomic coordination like preventing double-booking without additional mechanisms.

The “Shopping Cart Problem”: You can use an OR-Set for shopping cart items, but if two clients concurrently remove the same item, your naive implementation might remove both. The CRDT guarantees convergence to a consistent state, but that state might not match user expectations.

Byzantine fault tolerance is also a concern as traditional CRDTs assume all devices are trustworthy. Malicious devices can create permanent inconsistencies.

How the Simulator Works

The CRDT simulator demonstrates convergence through gossip-based replication. You can watch replicas diverge and converge as they exchange state.

Simulation Model:

  • Multiple replica nodes, each with independent CRDT state
  • Operations applied to random replicas (simulating distributed clients)
  • Periodic “merges” (gossip protocol) with probability merge_probability
  • Network delay between merges
  • Tracks convergence: do all replicas have identical state?

CRDT Implementations: Each CRDT type has its own semantics:

# G-Counter: Each replica has its own count, merge takes max
def merge(self, other):
    for replica_id, count in other.counts.items():
        self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)

# OR-Set: Elements have unique tags, add always beats remove
def add(self, element, unique_tag):
    self.elements[element].add(unique_tag)

def remove(self, element, observed_tags):
    self.elements[element] -= observed_tags  # Only remove what was observed

# LWW-Map: Latest timestamp wins
def set(self, key, value, timestamp):
    current = self.entries.get(key)
    if current is None or timestamp > current[1]:
        self.entries[key] = (value, timestamp, self.replica_id)

Key Parameters:

ParameterWhat It TestsValues
crdt_typeDifferent convergence semanticsG-Counter, PN-Counter, OR-Set, LWW-Map
n_replicasNumber of nodes2-8
n_operationsTotal updates10-100
merge_probabilityGossip frequency0.0-1.0
network_delayTime for state exchange0.0-2.0s

What You Can Learn:

  1. Convergence speed:
    • Set merge_probability=0.1 ? slow convergence, replicas stay diverged
    • Set merge_probability=0.8 ? fast convergence
    • Understand gossip frequency vs consistency window tradeoff
  2. OR-Set semantics:
    • Watch concurrent add/remove ? add wins
    • See how unique tags prevent unintended deletions
    • Compare with naive set implementation
  3. LWW-Map data loss:
    • Two replicas set same key concurrently with different values
    • One value “wins” based on timestamp (or replica ID tie-break)
    • Data loss is possible – not suitable for all use cases
  4. Network partition tolerance:
    • Low merge probability simulates partition
    • Replicas diverge but operations still succeed (AP in CAP)
    • After “partition heals” (merges resume), all converge
    • No coordination needed, no operations failed

The simulator visually shows replica states over time and convergence status, making abstract CRDT theory concrete.

Key Insights from Simulation

CRDTs trade immediate consistency for availability and partition tolerance. The theoretical guarantees are proven: if all replicas receive all updates (eventual delivery), they will converge to the same state (strong convergence).

But the simulator reveals the practical challenges:

  • Merge semantics don’t always match user intent (LWW can lose data)
  • Tombstones can grow indefinitely (OR-Set needs garbage collection)
  • Causal ordering adds complexity (need vector clocks for some CRDTs)
  • Not suitable for operations requiring coordination (uniqueness constraints, atomic updates)

When to use CRDTs:

  • High-write distributed counters (page views, analytics)
  • Collaborative editing (where eventual consistency is acceptable)
  • Offline-first applications (sync when online)
  • Shopping carts (with careful semantic design)

When NOT to use CRDTs:

  • Bank account balances (need atomic transactions)
  • Inventory (can’t prevent overselling without coordination)
  • Unique constraints (usernames, reservation systems)
  • Access control (need immediate consistency)

Part 4: Cross-System Interaction (CSI) Failures

Research from EuroSys 2023 found that 20% of catastrophic cloud incidents and 37% of failures in major open-source distributed systems are CSI failures – where both systems work correctly in isolation but fail when connected. This is the NASA Mars Climate Orbiter problem: one team used metric units, another used imperial. Both systems worked perfectly. The spacecraft burned up in Mars’s atmosphere because of their interaction.

Why CSI Failures Are Different

Not dependency failures: The downstream system is available, it just can’t process what upstream sends.

Not library bugs: Libraries are single-address-space and well-tested. CSI failures cross system boundaries where testing is expensive.

Not component failures: Each system passes its own test suite. The bug only emerges through interaction.

CSI failures manifest across three planes: Data plane (51% – schema/metadata mismatches), Management plane (32% – configuration incoherence), and Control plane (17% – API semantic violations).

For example, study of Apache Spark-Hive integration found 15 distinct discrepancies in simple write-read testing. Hive stored timestamps as long (milliseconds since epoch), Spark expected Timestamp type. Both worked in isolation, failed when integrated. Kafka and Flink encoding mismatch: Kafka set compression.type=lz4, Flink couldn’t decompress due to old LZ4 library. Configuration was silently ignored in Flink, leading to data corruption for 2 weeks before detection.

Why Testing Doesn’t Catch CSI Failures

Analysis of Spark found only 6% of integration tests actually test cross-system interaction. Most “integration tests” test multiple components of the same system. Cross-system testing is expensive and often skipped. The problem compounds with modern architectures:

  • Microservices: More system boundaries to test
  • Multi-cloud: Different clouds with different semantics
  • Serverless: Fine-grained composition increases interaction surface area

How the Simulator Works

The CSI failure simulator models two systems exchanging data, with configurable discrepancies in schemas, encodings, and configurations.

System Model:

  • Two systems (upstream ? downstream)
  • Each has its own schema definition (field types, encoding, nullable fields)
  • Each has its own configuration (timeouts, retry counts, etc.)
  • Data flows from System A to System B with potential conversion failures

Failure Scenarios:

  1. Metadata Mismatch (Hive/Spark):
    • System A: timestamp: long
    • System B: timestamp: Timestamp
    • Failure: Type coercion fails ~30% of the time
  2. Schema Conflict (Producer/Consumer):
    • System A: encoding: latin-1
    • System B: encoding: utf-8
    • Failure: Silent data corruption
  3. Configuration Incoherence (ServiceA/ServiceB):
    • System A: max_retries=3, timeout=30s
    • System B expects: max_retries=5, timeout=60s
    • Failure: ~40% of requests fail due to premature timeout
  4. API Semantic Violation (Upstream/Downstream):
    • Upstream assumes: synchronous, thread-safe
    • Downstream is: asynchronous, not thread-safe
    • Failure: Race conditions, out-of-order processing
  5. Type Confusion (SystemA/SystemB):
    • System A: amount: float
    • System B: amount: decimal
    • Failure: Precision loss in financial calculations

Implementation Details:

class DataSchema:
    def __init__(self, schema_id, fields, encoding, nullable_fields):
        self.fields = fields  # field_name -> type
        self.encoding = encoding
        
    def is_compatible(self, other):
        # Check field types and encoding
        return (self.fields == other.fields and 
                self.encoding == other.encoding)

class DataRecord:
    def serialize(self, target_schema):
        # Attempt type coercion
        for field, value in self.data.items():
            expected_type = target_schema.fields[field]
            actual_type = self.schema.fields[field]
            
            if expected_type != actual_type:
                # 30% failure on type mismatch (simulating real world)
                if random.random() < 0.3:
                    return None  # Serialization failure
        
        # Check encoding compatibility
        if self.schema.encoding != target_schema.encoding:
            if random.random() < 0.2:  # 20% silent corruption
                return None

Key Parameters:

ParameterWhat It Tests
failure_scenarioType of CSI failure (metadata, schema, config, API, type)
durationSimulation length
request_rateLoad (requests per second)

The simulator doesn’t have many tunable parameters because CSI failures are about specific incompatibilities, not gradual degradation. Each scenario models a real-world pattern.

What You Can Learn:

  1. Failure rates: CSI failures often manifest in 20-40% of requests (not 100%)
    • Some requests happen to have compatible data
    • Makes debugging harder (intermittent failures)
  2. Failure location:
    • Research shows 69% of CSI fixes go in the upstream system, often in connector modules that are less than 5% of the codebase
    • Simulator shows which system fails (usually downstream)
  3. Silent vs loud failures:
    • Type mismatches often crash (loud, easy to detect)
    • Encoding mismatches corrupt silently (hard to detect)
    • Config mismatches cause intermittent timeouts
  4. Prevention effectiveness:
    • Schema registry eliminates metadata mismatches
    • Configuration validation catches config incoherence
    • Contract testing prevents API semantic violations

Key Insights from Simulation

The simulator demonstrates that cross-system integration testing is essential but often skipped. Unit tests of each system won’t catch these failures.

Prevention strategies validated by simulation:

  1. Write-Read Testing: Write with System A, read with System B, verify integrity
  2. Schema Registry: Single source of truth for data schemas, enforced across systems
  3. Configuration Coherence Checking: Validate that shared configs match
  4. Contract Testing: Explicit, machine-checkable API contracts

Hybrid Consistency Models

Modern systems increasingly use mixed consistency: RedBlue Consistency (2012) marks operations as needing strong consistency (red) or eventual consistency (blue). Replicache (2024) has the server assign final total order while clients do optimistic local updates with rebase. For example: Calendar Application

# Strong consistency for room reservations (prevent double-booking)
def book_conference_room(room_id, time_slot):
    with transaction(consistency='STRONG'):
        if room.is_available(time_slot):
            room.book(time_slot)
            return True
        return False

# CRDTs for collaborative editing (participant lists, notes)
def update_meeting_notes(meeting_id, notes):
    # LWW-Map CRDT, eventual consistency
    meeting.notes.merge(notes)

# Eventual consistency for preferences
def update_user_calendar_color(user_id, color):
    # Who cares if this propagates slowly?
    user_prefs[user_id] = color

Recent theoretical work on the CALM theorem proves that coordination-free consistency is achievable for certain problem classes. Research in 2025 provided mathematical definitions of when coordination is and isn’t required, separating coordination from computation.

What the Simulators Teach Us

Running all four simulators reveals the consistency spectrum:

No “best” consistency model exists:

  • Quorums are best when you need linearizability and can tolerate latency
  • CRDTs are best when you need high availability and can tolerate eventual consistency
  • Neither approach “bypasses” CAP – they make different tradeoffs
  • Real systems use hybrid models with different consistency for different operations

Practical Lessons

1. Design for Recovery, Not Just Prevention

The metastable failure simulator shows you can’t prevent all failures. Your retry logic, backoff strategy, and circuit breakers are more important than your happy path code. Validated strategies include:

  • Exponential backoff with jitter (spread retries over time)
  • Adaptive retry budgets (limit total fleet-wide retries)
  • Circuit breakers (detect patterns, stop storms)
  • Load shedding (fail fast rather than queue to death)

2. Understand the Consistency Spectrum

The CAP/PACELC simulator demonstrates that consistency is not binary. You need to understand:

  • What consistency level do you actually need? (Most operations don’t need linearizability)
  • What’s the latency cost? (Quorum reads in cross-region deployment can be 100x slower)
  • What happens during partitions? (Can you sacrifice availability or must you serve stale data?)

Decision framework:

  • Use strong consistency for: money, inventory, locks, compliance
  • Use eventual consistency for: feeds, catalogs, analytics, caches
  • Use hybrid models for: most real-world applications

3. Test Cross-System Interactions

The CSI failure simulator reveals that 86% of fixes go into connector modules that are less than 5% of your codebase. This is where bugs hide. Essential tests include:

  • Write-read tests (write with System A, read with System B)
  • Round-trip tests (serialize/deserialize across boundaries)
  • Version compatibility matrix (test combinations)
  • Schema validation (machine-checkable contracts)

4. Leverage CRDTs Where Appropriate

The CRDT simulator shows that conflict-free convergence is possible for specific problem types. But you need to:

  • Understand the semantic limitations (LWW can lose data)
  • Design merge behavior carefully (does it match user intent?)
  • Handle garbage collection (tombstones, vector clocks)
  • Accept eventual consistency (not suitable for all use cases)

5. Monitor for Sustaining Effects

Metastability, retry storms, and goodput collapse are self-sustaining failure modes. They persist after the trigger is gone. Critical metrics include:

  • P99 latency vs timeout threshold (approaching timeout = danger)
  • Retry rate vs success rate (high retries = storm risk)
  • Queue depth (unbounded growth = admission control needed)
  • Goodput vs throughput (doing useful work vs spinning)

Using the Simulators

All four simulators are available at: https://github.com/bhatti/simulators

Installation

git clone https://github.com/bhatti/simulators
cd simulators
pip install -r requirements.txt

Requirements:

  • Python 3.7+
  • streamlit (web UI)
  • simpy (discrete event simulation)
  • plotly (interactive visualizations)
  • numpy, pandas (data analysis)

Running Individual Simulators

# Metastable failure simulator
streamlit run metastable_simulator.py

# CAP/PACELC consistency simulator
streamlit run cap_consistency_simulator.py

# CRDT simulator
streamlit run crdt_simulator.py

# CSI failure simulator
streamlit run csi_failure_simulator.py

Running All Simulators

python run_all_simulators.py

Conclusion

Building distributed systems means confronting failure modes that are expensive or impossible to reproduce in real environments:

  • Metastable failures require specific load patterns and timing
  • Consistency tradeoffs need multi-region deployments to observe
  • CRDT convergence requires orchestrating concurrent operations across replicas
  • CSI failures need exact schema/config mismatches that don’t exist in test environments

Simulators bridge the gap between theoretical understanding and practical intuition:

  1. Cheaper than production testing: No cloud costs, no multi-region setup, instant feedback
  2. Safer than production experiments: Crash the simulator, not your service
  3. More complete than unit tests: See emergent behaviors, not just component correctness
  4. Faster iteration: Tweak parameters, re-run in seconds, build intuition through experimentation

What You Can’t Learn Without Simulation

  • When does retry amplification tip into metastability? (Depends on coordination slope, timeout, backoff)
  • How much does quorum coordination actually cost? (Depends on network latency, replica count, workload)
  • Do your CRDT semantics match user expectations? (Depends on merge behavior, conflict resolution)
  • Will your schema changes break integration? (Depends on type coercion, encoding, version skew)

The goal isn’t to prevent all failures, that’s impossible. The goal is to understand, anticipate, and recover from the failures that will inevitably occur.


References

Key research papers and resources used in this post:

  1. AWS Metastability Research (HotOS 2025) – Sustaining effects and goodput collapse
  2. Marc Brooker on DSQL – Practical distributed SQL considerations
  3. James Hamilton on Reliable Systems – Large-scale system design
  4. CSI Failures Study (EuroSys 2023) – Cross-system interaction failures
  5. PACELC Framework – Beyond CAP theorem
  6. Marc Brooker on CAP – CAP theorem revisited
  7. Anna CRDT Database – Autoscaling with CRDTs
  8. Linearizability Paper – Herlihy & Wing’s foundational work
  9. Designing Data-Intensive Applications by Martin Kleppmann
  10. Distributed Systems Reading Group – MIT CSAIL
  11. Jepsen.io – Kyle Kingsbury’s consistency testing
  12. Aphyr’s blog – Distributed systems deep dives

November 7, 2025

Three Decades of Remote Calls: My Journey from COBOL Mainframes to AI Agents

Filed under: Computing,Web Services — admin @ 9:50 pm

Introduction

I started writing network code in the early 1990s on IBM mainframes, armed with nothing but Assembly and COBOL. Today, I build distributed AI agents using gRPC, RAG pipelines, and serverless functions. Between these worlds lie decades of technological evolution and an uncomfortable realization: we keep relearning the same lessons. Over the years, I’ve seen simple ideas triumph over complex ones. The technology keeps changing, but the problems stay the same. Network latency hasn’t gotten faster relative to CPU speed. Distributed systems are still hard. Complexity still kills projects. And every new generation has to learn that abstractions leak. I’ll show you the technologies I’ve used, the mistakes I’ve made, and most importantly, what the past teaches us about building better systems in the future.

The Mainframe Era

CICS and 3270 Terminals

I started my career on IBM mainframes running CICS, which was used to build online applications accessed through 3270 “green screen” terminals. It used LU6.2 (Logical Unit 6.2) protocol, part of IBM’s Systems Network Architecture (SNA) to provide peer-to-peer communication. Here’s what a typical CICS application looked like in COBOL:

IDENTIFICATION DIVISION.
PROGRAM-ID. CUSTOMER-INQUIRY.

DATA DIVISION.
WORKING-STORAGE SECTION.
01  CUSTOMER-REC.
    05  CUST-ID        PIC 9(8).
    05  CUST-NAME      PIC X(30).
    05  CUST-BALANCE   PIC 9(7)V99.

LINKAGE SECTION.
01  DFHCOMMAREA.
    05  COMM-CUST-ID   PIC 9(8).

PROCEDURE DIVISION.
    EXEC CICS
        RECEIVE MAP('CUSTMAP')
        MAPSET('CUSTSET')
        INTO(CUSTOMER-REC)
    END-EXEC.
    
    EXEC CICS
        READ FILE('CUSTFILE')
        INTO(CUSTOMER-REC)
        RIDFLD(COMM-CUST-ID)
    END-EXEC.
    
    EXEC CICS
        SEND MAP('RESULTMAP')
        MAPSET('CUSTSET')
        FROM(CUSTOMER-REC)
    END-EXEC.
    
    EXEC CICS RETURN END-EXEC.

The CICS environment handled all the complexity—transaction management, terminal I/O, file access, and inter-system communication. For the user interface, I used Basic Mapping Support (BMS), which was notoriously finicky. You had to define screen layouts in a rigid format specifying exactly where each field appeared on the 24×80 character grid:

CUSTMAP  DFHMSD TYPE=&SYSPARM,                                    X
               MODE=INOUT,                                        X
               LANG=COBOL,                                        X
               CTRL=FREEKB
         DFHMDI SIZE=(24,80)
CUSTID   DFHMDF POS=(05,20),                                      X
               LENGTH=08,                                         X
               ATTRB=(UNPROT,NUM),                                X
               INITIAL='________'
CUSTNAME DFHMDF POS=(07,20),                                      X
               LENGTH=30,                                         X
               ATTRB=PROT

This was so painful that I wrote my own tool to convert simple text-based UI templates into BMS format. Looking back, this was my first foray into creating developer tools. Key lesson I learned from the mainframe era was that developer experience mattered. Cumbersome tools slow down development and introduce errors.

Moving to UNIX

Berkeley Sockets

After working on mainframes for a couple of years, I saw the mainframes were already in decline and I then transitioned to C and UNIX systems, which I studied previously in my college. I learned about Berkeley Sockets, which was a lot more powerful and you had complete control over the network. Here’s a simple TCP server in C using Berkeley Sockets:

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define PORT 8080
#define BUFFER_SIZE 1024

int main() {
    int server_fd, client_fd;
    struct sockaddr_in server_addr, client_addr;
    socklen_t client_len = sizeof(client_addr);
    char buffer[BUFFER_SIZE];
    
    // Create socket
    server_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (server_fd < 0) {
        perror("socket failed");
        exit(EXIT_FAILURE);
    }
    
    // Set socket options to reuse address
    int opt = 1;
    if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, 
                   &opt, sizeof(opt)) < 0) {
        perror("setsockopt failed");
        exit(EXIT_FAILURE);
    }
    
    // Bind to address
    memset(&server_addr, 0, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_addr.s_addr = INADDR_ANY;
    server_addr.sin_port = htons(PORT);
    
    if (bind(server_fd, (struct sockaddr *)&server_addr, 
             sizeof(server_addr)) < 0) {
        perror("bind failed");
        exit(EXIT_FAILURE);
    }
    
    // Listen for connections
    if (listen(server_fd, 10) < 0) {
        perror("listen failed");
        exit(EXIT_FAILURE);
    }
    
    printf("Server listening on port %d\n", PORT);
    
    while (1) {
        // Accept connection
        client_fd = accept(server_fd, 
                          (struct sockaddr *)&client_addr, 
                          &client_len);
        if (client_fd < 0) {
            perror("accept failed");
            continue;
        }
        
        // Read request
        ssize_t bytes_read = recv(client_fd, buffer, 
                                  BUFFER_SIZE - 1, 0);
        if (bytes_read > 0) {
            buffer[bytes_read] = '\0';
            printf("Received: %s\n", buffer);
            
            // Send response
            const char *response = "Message received\n";
            send(client_fd, response, strlen(response), 0);
        }
        
        close(client_fd);
    }
    
    close(server_fd);
    return 0;
}

As you can see, you had to track a lot of housekeeping like socket creation, binding, listening, accepting, reading, writing, and meticulous error handling at every step. Memory management was entirely manual—forget to close() a file descriptor and you’d leak resources. If you make a mistake with recv() buffer sizes and you’d overflow memory. I also experimented with Fast Sockets from UC Berkeley, which used kernel bypass techniques for lower latency and offered better performance.

Key lesson I learned was that low-level control comes at a steep cost. The cognitive load of managing these details makes it nearly impossible to focus on business logic.

Sun RPC and XDR

When working for a physics lab with a large computing facilities consists of Sun workstations, Solaris, and SPARC processors, I discovered Sun RPC (Remote Procedure Call) with XDR (External Data Representation). XDR solved a critical problem: how do you exchange data between machines with different architectures? A SPARC processor uses big-endian byte ordering, while x86 uses little-endian. XDR provided a canonical, architecture-neutral format for representing data. Here’s an XDR definition file (types.x):

/* Define a structure for customer data */
struct customer {
    int customer_id;
    string name<30>;
    float balance;
};

/* Define the RPC program */
program CUSTOMER_PROG {
    version CUSTOMER_VERS {
        int ADD_CUSTOMER(customer) = 1;
        customer GET_CUSTOMER(int) = 2;
    } = 1;
} = 0x20000001;

You’d run rpcgen on this file:

$ rpcgen types.x

This generated the client stub, server stub, and XDR serialization code automatically. Here’s what the server implementation looked like:

#include "types.h"

int *add_customer_1_svc(customer *cust, struct svc_req *rqstp) {
    static int result;
    
    // Add customer to database
    printf("Adding customer: %s (ID: %d)\n", 
           cust->name, cust->customer_id);
    
    result = 1;  // Success
    return &result;
}

customer *get_customer_1_svc(int *cust_id, struct svc_req *rqstp) {
    static customer result;
    
    // Fetch from database
    result.customer_id = *cust_id;
    result.name = strdup("John Doe");
    result.balance = 1000.50;
    
    return &result;
}

And the client:

#include "types.h"

int main(int argc, char *argv[]) {
    CLIENT *clnt;
    customer cust;
    int *result;
    
    clnt = clnt_create("localhost", CUSTOMER_PROG, 
                       CUSTOMER_VERS, "tcp");
    if (clnt == NULL) {
        clnt_pcreateerror("localhost");
        exit(1);
    }
    
    // Call remote procedure
    cust.customer_id = 123;
    cust.name = "Alice Smith";
    cust.balance = 5000.00;
    
    result = add_customer_1(&cust, clnt);
    if (result == NULL) {
        clnt_perror(clnt, "call failed");
    }
    
    clnt_destroy(clnt);
    return 0;
}

This was my first introduction to Interface Definition Languages (IDL) and I found that defining the contract once and generating code automatically reduces errors. This pattern would reappear in CORBA, Protocol Buffers, and gRPC.

Parallel Computing

During my graduate and post-graduate studies in mid 1990s while working full time, I researched into the parallel and distributed computing. I worked with MPI (Message Passing Interface) and IBM’s MPL on SP1/SP2 systems. MPI provided collective operations like broadcast, scatter, gather, and reduce (predecessor to Hadoop like map/reduce). Here’s a simple MPI example that computes the sum of an array in parallel:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

#define ARRAY_SIZE 1000

int main(int argc, char** argv) {
    int rank, size;
    int data[ARRAY_SIZE];
    int local_sum = 0, global_sum = 0;
    int chunk_size, start, end;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    // Initialize data on root
    if (rank == 0) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            data[i] = i + 1;
        }
    }
    
    // Broadcast data to all processes
    MPI_Bcast(data, ARRAY_SIZE, MPI_INT, 0, MPI_COMM_WORLD);
    
    // Each process computes sum of its chunk
    chunk_size = ARRAY_SIZE / size;
    start = rank * chunk_size;
    end = (rank == size - 1) ? ARRAY_SIZE : start + chunk_size;
    
    for (int i = start; i < end; i++) {
        local_sum += data[i];
    }
    
    // Reduce all local sums to global sum
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, 
               MPI_SUM, 0, MPI_COMM_WORLD);
    
    if (rank == 0) {
        printf("Global sum: %d\n", global_sum);
    }
    
    MPI_Finalize();
    return 0;
}

For my post-graduate project, I built JavaNOW (Java on Networks of Workstations), which was inspired by Linda’s tuple spaces and MPI’s collective operations, but implemented in pure Java for portability. The key innovation was our Actor-inspired model. Instead of heavyweight processes communicating through message passing, I used lightweight Java threads with an Entity Space (distributed associative memory) where “actors” could put and get entities asynchronously. Here’s a simple example:

public class SumTask extends ActiveEntity {
    public Object execute(Object arg, JavaNOWAPI api) {
        Integer myId = (Integer) arg;
        EntitySpace workspace = new EntitySpace("RESULTS");
        
        // Compute partial sum
        int partialSum = 0;
        for (int i = myId * 100; i < (myId + 1) * 100; i++) {
            partialSum += i;
        }
        
        // Store result in EntitySpace
        return new Integer(partialSum);
    }
}

// Main application
public class ParallelSum extends JavaNOWApplication {
    public void master() {
        EntitySpace workspace = new EntitySpace("RESULTS");
        
        // Spawn parallel tasks
        for (int i = 0; i < 10; i++) {
            ActiveEntity task = new SumTask(new Integer(i));
            getJavaNOWAPI().eval(workspace, task, new Integer(i));
        }
        
        // Collect results
        int totalSum = 0;
        for (int i = 0; i < 10; i++) {
            Entity result = getJavaNOWAPI().get(
                workspace, new Entity(new Integer(i)));
            totalSum += ((Integer)result.getEntityValue()).intValue();
        }
        
        System.out.println("Total sum: " + totalSum);
    }
    
    public void slave(int id) {
        // Slave nodes wait for work
    }
}

Since then, I have seen the Actor model have gained a wide adoption. For example, today’s serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) and modern frameworks like Akka, Orleans, and Dapr all embrace Actor-inspired patterns.

Novell and CGI

I also briefly worked with Novell’s IPX (Internetwork Packet Exchange) protocol, which had painful APIs. Here’s a taste of IPX socket programming (simplified):

#include <nwcalls.h>
#include <nwipxspx.h>

int main() {
    IPXAddress server_addr;
    IPXPacket packet;
    WORD socket_number = 0x4000;
    
    // Open IPX socket
    IPXOpenSocket(socket_number, 0);
    
    // Setup address
    memset(&server_addr, 0, sizeof(IPXAddress));
    memcpy(server_addr.network, target_network, 4);
    memcpy(server_addr.node, target_node, 6);
    server_addr.socket = htons(socket_number);
    
    // Send packet
    packet.packetType = 4;  // IPX packet type
    memcpy(packet.data, "Hello", 5);
    IPXSendPacket(socket_number, &server_addr, &packet);
    
    IPXCloseSocket(socket_number);
    return 0;
}

Early Web Development with CGI

When the web emerged in early 1990s, I built applications using CGI (Common Gateway Interface) with Perl and C. I deployed these on Apache HTTP Server, which was the first production-quality open source web server and quickly became the dominant web server of the 1990s. Apache used process-driven concurrency where it forked a new process for each request or maintained a pool of pre-forked processes. CGI was conceptually simple: the web server launched a new UNIX process for every request, passing input via stdin and receiving output via stdout. Here’s a simple Perl CGI script:

#!/usr/bin/perl
use strict;
use warnings;
use CGI;

my $cgi = CGI->new;

print $cgi->header('text/html');
print "<html><body>\n";
print "<h1>Hello from CGI!</h1>\n";

my $name = $cgi->param('name') || 'Guest';
print "<p>Welcome, $name!</p>\n";

# Simulate database query
my $user_count = 42;
print "<p>Total users: $user_count</p>\n";

print "</body></html>\n";

And in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    char *query_string = getenv("QUERY_STRING");
    
    printf("Content-Type: text/html\n\n");
    printf("<html><body>\n");
    printf("<h1>CGI in C</h1>\n");
    
    if (query_string) {
        printf("<p>Query string: %s</p>\n", query_string);
    }
    
    printf("</body></html>\n");
    return 0;
}

Later, I migrated to more performant servers: Tomcat for Java servlets, Jetty as an embedded server, and Netty for building custom high-performance network applications. These servers used asynchronous I/O and lightweight threads (or even non-blocking event loops in Netty‘s case).

Key Lesson I learned was that scalability matters. The CGI model’s inability to maintain persistent connections or share state made it unsuitable for modern web applications. The shift from process-per-request to thread pools and then to async I/O represented fundamental improvements in how we handle concurrency.

Java Adoption

When Java was released in 1995, I adopted it wholeheartedly. It saved developers from manual memory management using malloc() and free() debugging. Network programming became far more approachable:

import java.io.*;
import java.net.*;

public class SimpleServer {
    public static void main(String[] args) throws IOException {
        int port = 8080;
        
        try (ServerSocket serverSocket = new ServerSocket(port)) {
            System.out.println("Server listening on port " + port);
            
            while (true) {
                try (Socket clientSocket = serverSocket.accept();
                     BufferedReader in = new BufferedReader(
                         new InputStreamReader(clientSocket.getInputStream()));
                     PrintWriter out = new PrintWriter(
                         clientSocket.getOutputStream(), true)) {
                    
                    String request = in.readLine();
                    System.out.println("Received: " + request);
                    
                    out.println("Message received");
                }
            }
        }
    }
}

Java Threads

I had previously used pthreads in C, which were hard to use but Java’s threading model was far simpler:

public class ConcurrentServer {
    public static void main(String[] args) throws IOException {
        ServerSocket serverSocket = new ServerSocket(8080);
        
        while (true) {
            Socket clientSocket = serverSocket.accept();
            
            // Spawn thread to handle client
            new Thread(new ClientHandler(clientSocket)).start();
        }
    }
    
    static class ClientHandler implements Runnable {
        private Socket socket;
        
        public ClientHandler(Socket socket) {
            this.socket = socket;
        }
        
        public void run() {
            try (BufferedReader in = new BufferedReader(
                     new InputStreamReader(socket.getInputStream()));
                 PrintWriter out = new PrintWriter(
                     socket.getOutputStream(), true)) {
                
                String request = in.readLine();
                // Process request
                out.println("Response");
                
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try { socket.close(); } catch (IOException e) {}
            }
        }
    }
}

Java’s synchronized keyword simplified thread-safe programming:

public class ThreadSafeCounter {
    private int count = 0;
    
    public synchronized void increment() {
        count++;
    }
    
    public synchronized int getCount() {
        return count;
    }
}

This was so much easier than managing mutexes, condition variables, and semaphores in C!

Java RMI: Remote Objects Made

When Java added RMI (1997), distributed objects became practical. You could invoke methods on objects running on remote machines almost as if they were local. Define a remote interface:

import java.rmi.Remote;
import java.rmi.RemoteException;

public interface Calculator extends Remote {
    int add(int a, int b) throws RemoteException;
    int multiply(int a, int b) throws RemoteException;
}

Implement it:

import java.rmi.server.UnicastRemoteObject;
import java.rmi.RemoteException;

public class CalculatorImpl extends UnicastRemoteObject 
                            implements Calculator {
    
    public CalculatorImpl() throws RemoteException {
        super();
    }
    
    public int add(int a, int b) throws RemoteException {
        return a + b;
    }
    
    public int multiply(int a, int b) throws RemoteException {
        return a * b;
    }
}

Server:

import java.rmi.Naming;
import java.rmi.registry.LocateRegistry;

public class Server {
    public static void main(String[] args) {
        try {
            LocateRegistry.createRegistry(1099);
            Calculator calc = new CalculatorImpl();
            Naming.rebind("Calculator", calc);
            System.out.println("Server ready");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Client:

import java.rmi.Naming;

public class Client {
    public static void main(String[] args) {
        try {
            Calculator calc = (Calculator) Naming.lookup(
                "rmi://localhost/Calculator");
            
            int result = calc.add(5, 3);
            System.out.println("5 + 3 = " + result);
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I found that RMI was constrained and everything had to extend Remote, and you were stuck with Java-to-Java communication. Key lesson I learned was that abstractions that feel natural to developers get adopted.

JINI: RMI with Service Discovery

At a travel booking company in the mid 2000s, I used JINI, which Sun Microsystems pitched as “RMI on steroids.” JINI extended RMI with automatic service discovery, leasing, and distributed events. The core idea: services could join a network, advertise themselves, and be discovered by clients without hardcoded locations. Here’s a JINI service interface and registration:

import net.jini.core.lookup.ServiceRegistrar;
import net.jini.discovery.LookupDiscovery;
import net.jini.lease.LeaseRenewalManager;
import java.rmi.Remote;
import java.rmi.RemoteException;

// Service interface
public interface BookingService extends Remote {
    String searchFlights(String origin, String destination) 
        throws RemoteException;
    boolean bookFlight(String flightId, String passenger) 
        throws RemoteException;
}

// Service provider
public class BookingServiceProvider implements DiscoveryListener {
    
    public void discovered(DiscoveryEvent event) {
        ServiceRegistrar[] registrars = event.getRegistrars();
        
        for (ServiceRegistrar registrar : registrars) {
            try {
                BookingService service = new BookingServiceImpl();
                Entry[] attributes = new Entry[] {
                    new Name("FlightBookingService")
                };
                
                ServiceItem item = new ServiceItem(null, service, attributes);
                ServiceRegistration reg = registrar.register(
                    item, Lease.FOREVER);
                
                // Auto-renew lease
                leaseManager.renewUntil(reg.getLease(), Lease.FOREVER, null);
                
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Client discovery and usage:

public class BookingClient implements DiscoveryListener {
    
    public void discovered(DiscoveryEvent event) {
        ServiceRegistrar[] registrars = event.getRegistrars();
        
        for (ServiceRegistrar registrar : registrars) {
            try {
                ServiceTemplate template = new ServiceTemplate(
                    null, new Class[] { BookingService.class }, null);
                
                ServiceItem item = registrar.lookup(template);
                
                if (item != null) {
                    BookingService booking = (BookingService) item.service;
                    String flights = booking.searchFlights("SFO", "NYC");
                    booking.bookFlight("FL123", "John Smith");
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Though, JINI provided automatic discovery, leasing and location transparency but it was too complex and only supported Java ecosystem. The ideas were sound and reappeared later in service meshes (Consul, Eureka) and Kubernetes service discovery. I learned that service discovery is essential for dynamic systems, but the implementation must be simple.

CORBA

I used CORBA (Common Object Request Broker Architecture) for many years in 1990s when building intelligent traffic Systems. CORBA promised the language-independent, platform-independent distributed objects. You could write a service in C++, invoke it from Java, and have clients in Python using the same IDL. Here’s a simple CORBA IDL definition:

module TrafficMonitor {
    struct SensorData {
        long sensor_id;
        float speed;
        long timestamp;
    };
    
    typedef sequence<SensorData> SensorDataList;
    
    interface TrafficService {
        void reportData(in SensorData data);
        SensorDataList getRecentData(in long minutes);
        float getAverageSpeed();
    };
};

Run the IDL compiler:

$ idl traffic.idl

This generated client stubs and server skeletons for your target language. I built a message-oriented middleware (MOM) system with CORBA that collected traffic data from road sensors and provided real-time traffic information.

C++ server implementation:

#include "TrafficService_impl.h"
#include <iostream>
#include <vector>

class TrafficServiceImpl : public POA_TrafficMonitor::TrafficService {
private:
    std::vector<TrafficMonitor::SensorData> data_store;
    
public:
    void reportData(const TrafficMonitor::SensorData& data) {
        data_store.push_back(data);
        std::cout << "Received data from sensor " 
                  << data.sensor_id << std::endl;
    }
    
    TrafficMonitor::SensorDataList* getRecentData(CORBA::Long minutes) {
        TrafficMonitor::SensorDataList* result = 
            new TrafficMonitor::SensorDataList();
        
        // Filter data from last N minutes
        time_t cutoff = time(NULL) - (minutes * 60);
        for (const auto& entry : data_store) {
            if (entry.timestamp >= cutoff) {
                result->length(result->length() + 1);
                (*result)[result->length() - 1] = entry;
            }
        }
        return result;
    }
    
    CORBA::Float getAverageSpeed() {
        if (data_store.empty()) return 0.0;
        
        float sum = 0.0;
        for (const auto& entry : data_store) {
            sum += entry.speed;
        }
        return sum / data_store.size();
    }
};

Java client:

import org.omg.CORBA.*;
import TrafficMonitor.*;

public class TrafficClient {
    public static void main(String[] args) {
        try {
            // Initialize ORB
            ORB orb = ORB.init(args, null);
            
            // Get reference to service
            org.omg.CORBA.Object obj = 
                orb.string_to_object("corbaname::localhost:1050#TrafficService");
            TrafficService service = TrafficServiceHelper.narrow(obj);
            
            // Report sensor data
            SensorData data = new SensorData();
            data.sensor_id = 101;
            data.speed = 65.5f;
            data.timestamp = (int)(System.currentTimeMillis() / 1000);
            
            service.reportData(data);
            
            // Get average speed
            float avgSpeed = service.getAverageSpeed();
            System.out.println("Average speed: " + avgSpeed + " mph");
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

However, CORBA specification was massive and different ORB (Object Request Broker) implementations like Orbix, ORBacus, and TAO couldn’t reliably interoperate despite claiming CORBA compliance. The binary protocol, IIOP, had subtle incompatibilities. CORBA did introduce valuable concepts:

  • Interceptors for cross-cutting concerns (authentication, logging, monitoring)
  • IDL-first design that forced clear interface definitions
  • Language-neutral protocols that actually worked (sometimes)

I learned that standards designed by committee are often over-engineer. CORBA, SOAP tried to solve every problem for everyone and ended up being optimal for no one.

SOAP and WSDL

I used SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language) on a number of projects in early 2000s that emerged as the standard for web services. The pitch: XML-based, platform-neutral, and “simple.” Here’s a WSDL definition:

<?xml version="1.0"?>
<definitions name="CustomerService"
   targetNamespace="http://example.com/customer"
   xmlns="http://schemas.xmlsoap.org/wsdl/"
   xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
   xmlns:tns="http://example.com/customer"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   
   <types>
      <xsd:schema targetNamespace="http://example.com/customer">
         <xsd:complexType name="Customer">
            <xsd:sequence>
               <xsd:element name="id" type="xsd:int"/>
               <xsd:element name="name" type="xsd:string"/>
               <xsd:element name="balance" type="xsd:double"/>
            </xsd:sequence>
         </xsd:complexType>
      </xsd:schema>
   </types>
   
   <message name="GetCustomerRequest">
      <part name="customerId" type="xsd:int"/>
   </message>
   
   <message name="GetCustomerResponse">
      <part name="customer" type="tns:Customer"/>
   </message>
   
   <portType name="CustomerPortType">
      <operation name="getCustomer">
         <input message="tns:GetCustomerRequest"/>
         <output message="tns:GetCustomerResponse"/>
      </operation>
   </portType>
   
   <binding name="CustomerBinding" type="tns:CustomerPortType">
      <soap:binding transport="http://schemas.xmlsoap.org/soap/http"/>
      <operation name="getCustomer">
         <soap:operation soapAction="getCustomer"/>
         <input>
            <soap:body use="literal"/>
         </input>
         <output>
            <soap:body use="literal"/>
         </output>
      </operation>
   </binding>
   
   <service name="CustomerService">
      <port name="CustomerPort" binding="tns:CustomerBinding">
         <soap:address location="http://example.com/customer"/>
      </port>
   </service>
</definitions>

A SOAP request looked like this:

<?xml version="1.0"?>
<soap:Envelope 
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cust="http://example.com/customer">
  <soap:Header>
    <cust:Authentication>
      <cust:username>john</cust:username>
      <cust:password>secret</cust:password>
    </cust:Authentication>
  </soap:Header>
  <soap:Body>
    <cust:getCustomer>
      <cust:customerId>12345</cust:customerId>
    </cust:getCustomer>
  </soap:Body>
</soap:Envelope>

The response:

<?xml version="1.0"?>
<soap:Envelope 
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cust="http://example.com/customer">
  <soap:Body>
    <cust:getCustomerResponse>
      <cust:customer>
        <cust:id>12345</cust:id>
        <cust:name>John Smith</cust:name>
        <cust:balance>5000.00</cust:balance>
      </cust:customer>
    </cust:getCustomerResponse>
  </soap:Body>
</soap:Envelope>

You can look at all that XML overhead! A simple request became hundreds of bytes of markup. As SOAP was designed by committee (IBM, Oracle, Microsoft), it tried to solve every possible enterprise problem: transactions, security, reliability, routing, orchestration. I learned that simplicity beats features and SOAP collapsed under its own weight.

Java Servlets and Filters

With Java 1.1, it added support for Servlets that provided a much better model than CGI. Instead of spawning a process per request, servlets were Java classes instantiated once and reused across requests:

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;

public class CustomerServlet extends HttpServlet {
    
    protected void doGet(HttpServletRequest request, 
                        HttpServletResponse response)
            throws ServletException, IOException {
        
        String customerId = request.getParameter("id");
        
        response.setContentType("application/json");
        PrintWriter out = response.getWriter();
        
        // Fetch customer data
        Customer customer = getCustomerFromDatabase(customerId);
        
        if (customer != null) {
            out.println(String.format(
                "{\"id\": \"%s\", \"name\": \"%s\", \"balance\": %.2f}",
                customer.getId(), customer.getName(), customer.getBalance()
            ));
        } else {
            response.setStatus(HttpServletResponse.SC_NOT_FOUND);
            out.println("{\"error\": \"Customer not found\"}");
        }
    }
    
    protected void doPost(HttpServletRequest request, 
                         HttpServletResponse response)
            throws ServletException, IOException {
        
        BufferedReader reader = request.getReader();
        StringBuilder json = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            json.append(line);
        }
        
        // Parse JSON and create customer
        Customer customer = parseJsonToCustomer(json.toString());
        saveCustomerToDatabase(customer);
        
        response.setStatus(HttpServletResponse.SC_CREATED);
        response.setContentType("application/json");
        PrintWriter out = response.getWriter();
        out.println(json.toString());
    }
}

Servlet Filters

The Filter API with Java Servlets was quite powerful and it supported a chain-of-responsibility pattern for handling cross-cutting concerns:

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.IOException;

public class AuthenticationFilter implements Filter {
    
    public void doFilter(ServletRequest request, 
                        ServletResponse response,
                        FilterChain chain) 
            throws IOException, ServletException {
        
        HttpServletRequest httpRequest = (HttpServletRequest) request;
        HttpServletResponse httpResponse = (HttpServletResponse) response;
        
        // Check for authentication token
        String token = httpRequest.getHeader("Authorization");
        
        if (token == null || !isValidToken(token)) {
            httpResponse.setStatus(HttpServletResponse.SC_UNAUTHORIZED);
            httpResponse.getWriter().println("{\"error\": \"Unauthorized\"}");
            return;
        }
        
        // Pass to next filter or servlet
        chain.doFilter(request, response);
    }
    
    private boolean isValidToken(String token) {
        // Validate token
        return token.startsWith("Bearer ") && 
               validateJWT(token.substring(7));
    }
}

Configuration in web.xml:

<filter>
    <filter-name>AuthenticationFilter</filter-name>
    <filter-class>com.example.AuthenticationFilter</filter-class>
</filter>

<filter-mapping>
    <filter-name>AuthenticationFilter</filter-name>
    <url-pattern>/api/*</url-pattern>
</filter-mapping>

You could chain filters for compression, logging, transformation, rate limiting with clean separation of concerns without touching business logic. I previously had experienced with CORBA interceptors for injecting cross-cutting business logic and the filter pattern solved similar cross-cutting concerns problem. This pattern would reappear in service meshes and API gateways.

Enterprise Java Beans

I used Enterprise Java Beans (EJB) in late 1990s and early 2000s that attempted to make distributed objects transparent. Its key idea was that use regular Java objects and let the application server handle all the distribution, persistence, transactions, and security. Here’s what an EJB 2.x entity bean looked like:

// Remote interface
public interface Customer extends EJBObject {
    String getName() throws RemoteException;
    void setName(String name) throws RemoteException;
    double getBalance() throws RemoteException;
    void setBalance(double balance) throws RemoteException;
}

// Home interface
public interface CustomerHome extends EJBHome {
    Customer create(Integer id, String name) throws CreateException, RemoteException;
    Customer findByPrimaryKey(Integer id) throws FinderException, RemoteException;
}

// Bean implementation
public class CustomerBean implements EntityBean {
    private Integer id;
    private String name;
    private double balance;
    
    public String getName() { return name; }
    public void setName(String name) { this.name = name; }
    public double getBalance() { return balance; }
    public void setBalance(double balance) { this.balance = balance; }
    
    // Container callbacks
    public void ejbActivate() {}
    public void ejbPassivate() {}
    public void ejbLoad() {}
    public void ejbStore() {}
    public void setEntityContext(EntityContext ctx) {}
    public void unsetEntityContext() {}
    
    public Integer ejbCreate(Integer id, String name) {
        this.id = id;
        this.name = name;
        this.balance = 0.0;
        return null;
    }
    
    public void ejbPostCreate(Integer id, String name) {}
}

The N+1 Selects Problem and Network Fallacy

The fatal flaw: EJB pretended network calls were free. I watched teams write code like this:

CustomerHome home = // ... lookup
Customer customer = home.findByPrimaryKey(customerId);

// Each getter is a remote call!
String name = customer.getName();        // Network call
double balance = customer.getBalance();  // Network call

Worse, I saw code that made remote calls in loops:

Collection customers = home.findAll();
double totalBalance = 0.0;
for (Customer customer : customers) {
    // Remote call for EVERY iteration!
    totalBalance += customer.getBalance();
}

This violated the first Fallacy of Distributed Computing: The network is reliable. It’s also not zero latency. What looked like simple property access actually made HTTP calls to a remote server. I had previously built distributed and parallel applications, so I understood network latency. But it blindsided most developers because EJB deliberately hid it.

I learned that you can’t hide distribution. Network calls are fundamentally different from local calls. Latency, failure modes, and semantics are different. Transparency is a lie.

REST Standard

Before REST became mainstream, I experimented with “Plain Old XML” (POX) over HTTP by just sending XML documents via HTTP POST without all the SOAP ceremony:

import requests
import xml.etree.ElementTree as ET

# Create XML request
root = ET.Element('getCustomer')
ET.SubElement(root, 'customerId').text = '12345'
xml_data = ET.tostring(root, encoding='utf-8')

# Send HTTP POST
response = requests.post(
    'http://api.example.com/customer',
    data=xml_data,
    headers={'Content-Type': 'application/xml'}
)

# Parse response
response_tree = ET.fromstring(response.content)
name = response_tree.find('name').text

This was simpler than SOAP, but still ad-hoc. Then REST (Representational State Transfer), based on Roy Fielding’s 2000 dissertation offered a principled approach:

  • Use HTTP methods semantically (GET, POST, PUT, DELETE)
  • Resources have URLs
  • Stateless communication
  • Hypermedia as the engine of application state (HATEOAS)

Here’s a RESTful API in Python with Flask:

from flask import Flask, jsonify, request

app = Flask(__name__)

# In-memory data store
customers = {
    '12345': {'id': '12345', 'name': 'John Smith', 'balance': 5000.00}
}

@app.route('/customers/<customer_id>', methods=['GET'])
def get_customer(customer_id):
    customer = customers.get(customer_id)
    if customer:
        return jsonify(customer), 200
    return jsonify({'error': 'Customer not found'}), 404

@app.route('/customers', methods=['POST'])
def create_customer():
    data = request.get_json()
    customer_id = data['id']
    customers[customer_id] = data
    return jsonify(data), 201

@app.route('/customers/<customer_id>', methods=['PUT'])
def update_customer(customer_id):
    if customer_id not in customers:
        return jsonify({'error': 'Customer not found'}), 404
    
    data = request.get_json()
    customers[customer_id].update(data)
    return jsonify(customers[customer_id]), 200

@app.route('/customers/<customer_id>', methods=['DELETE'])
def delete_customer(customer_id):
    if customer_id in customers:
        del customers[customer_id]
        return '', 204
    return jsonify({'error': 'Customer not found'}), 404

if __name__ == '__main__':
    app.run(debug=True)

Client code became trivial:

import requests

# GET customer
response = requests.get('http://localhost:5000/customers/12345')
if response.status_code == 200:
    customer = response.json()
    print(f"Customer: {customer['name']}")

# Create new customer
new_customer = {
    'id': '67890',
    'name': 'Alice Johnson',
    'balance': 3000.00
}
response = requests.post(
    'http://localhost:5000/customers',
    json=new_customer
)

# Update customer
update_data = {'balance': 3500.00}
response = requests.put(
    'http://localhost:5000/customers/67890',
    json=update_data
)

# Delete customer
response = requests.delete('http://localhost:5000/customers/67890')

Hypermedia and HATEOAS

True REST embraced hypermedia—responses included links to related resources:

{
  "id": "12345",
  "name": "John Smith",
  "balance": 5000.00,
  "_links": {
    "self": {"href": "/customers/12345"},
    "orders": {"href": "/customers/12345/orders"},
    "transactions": {"href": "/customers/12345/transactions"}
  }
}

In practice, most APIs called “REST” weren’t truly RESTful and didn’t implement HATEOAS or use HTTP status codes correctly. But even “REST-ish” APIs were far simpler than SOAP. Key lesson I leared was that REST succeeded because it built on HTTP, something every platform already supported. No new protocols, no complex tooling. Just URLs, HTTP verbs, and JSON.

JSON Replaces XML

With adoption of REST, I saw a decline of XML Web Services (JAX-WS) and I used JAX-RS for REST services that supported JSON payload. XML required verbose markup:

<?xml version="1.0"?>
<customer>
    <id>12345</id>
    <name>John Smith</name>
    <balance>5000.00</balance>
    <orders>
        <order>
            <id>001</id>
            <date>2024-01-15</date>
            <total>99.99</total>
        </order>
        <order>
            <id>002</id>
            <date>2024-02-20</date>
            <total>149.50</total>
        </order>
    </orders>
</customer>

The same data in JSON:

{
  "id": "12345",
  "name": "John Smith",
  "balance": 5000.00,
  "orders": [
    {
      "id": "001",
      "date": "2024-01-15",
      "total": 99.99
    },
    {
      "id": "002",
      "date": "2024-02-20",
      "total": 149.50
    }
  ]
}

JSON does have limitations. It doesn’t natively support references or circular structures, making recursive relationships awkward:

{
  "id": "A",
  "children": [
    {
      "id": "B",
      "parent_id": "A"
    }
  ]
}

You have to encode references manually, unlike some XML schemas that support IDREF.

Erlang/OTP

I learned about actor model in college and built a framework based on actors and Linda memory model. In the mid-2000s, I encountered Erlang that used actors for building distributed systems. Erlang was designed in the 1980s at Ericsson for building telecom switches and is based on following design:

  • “Let it crash” philosophy
  • No shared memory between processes
  • Lightweight processes (not OS threads—Erlang processes)
  • Supervision trees for fault recovery
  • Hot code swapping for zero-downtime updates

Here’s what an Erlang actor (process) looks like:

-module(customer_server).
-export([start/0, init/0, get_customer/1, update_balance/2]).

% Start the server
start() ->
    Pid = spawn(customer_server, init, []),
    register(customer_server, Pid),
    Pid.

% Initialize with empty state
init() ->
    State = #{},  % Empty map
    loop(State).

% Main loop - handle messages
loop(State) ->
    receive
        {get_customer, CustomerId, From} ->
            Customer = maps:get(CustomerId, State, not_found),
            From ! {customer, Customer},
            loop(State);
        
        {update_balance, CustomerId, NewBalance, From} ->
            Customer = maps:get(CustomerId, State),
            UpdatedCustomer = Customer#{balance => NewBalance},
            NewState = maps:put(CustomerId, UpdatedCustomer, State),
            From ! {ok, updated},
            loop(NewState);
        
        {add_customer, CustomerId, Customer, From} ->
            NewState = maps:put(CustomerId, Customer, State),
            From ! {ok, added},
            loop(NewState);
        
        stop ->
            ok;
        
        _ ->
            loop(State)
    end.

% Client functions
get_customer(CustomerId) ->
    customer_server ! {get_customer, CustomerId, self()},
    receive
        {customer, Customer} -> Customer
    after 5000 ->
        timeout
    end.

update_balance(CustomerId, NewBalance) ->
    customer_server ! {update_balance, CustomerId, NewBalance, self()},
    receive
        {ok, updated} -> ok
    after 5000 ->
        timeout
    end.

Erlang made concurrency became simple by using messaging passing with actors.

The Supervision Tree

A key innovation of Erlang was supervision trees. You organized processes in a hierarchy, and supervisors would restart crashed children:

-module(customer_supervisor).
-behaviour(supervisor).

-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    % Supervisor strategy
    SupFlags = #{
        strategy => one_for_one,  % Restart only failed child
        intensity => 5,            % Max 5 restarts
        period => 60               % Per 60 seconds
    },
    
    % Child specifications
    ChildSpecs = [
        #{
            id => customer_server,
            start => {customer_server, start, []},
            restart => permanent,   % Always restart
            shutdown => 5000,
            type => worker,
            modules => [customer_server]
        },
        #{
            id => order_server,
            start => {order_server, start, []},
            restart => permanent,
            shutdown => 5000,
            type => worker,
            modules => [order_server]
        }
    ],
    
    {ok, {SupFlags, ChildSpecs}}.

If a process crashed, the supervisor automatically restarted it and the system self-healed. A key lesson I learned from actor model and Erlang was that a shared mutable state is the enemy. Message passing with isolated state is simpler, more reliable, and easier to reason about. Today, AWS Lambda, Azure Durable Functions, and frameworks like Akka all embrace the Actor model.

Distributed Erlang

Erlang made distributed computing almost trivial. Processes on different nodes communicated identically to local processes:

% On node1@host1
RemotePid = spawn('node2@host2', module, function, [args]),
RemotePid ! {message, data}.

% On node2@host2 - receives the message
receive
    {message, Data} -> 
        io:format("Received: ~p~n", [Data])
end.

The VM handled all the complexity of node discovery, connection management, and message routing. Today’s serverless functions are actors and kubernetes pods are supervised processes.

Asynchronous Messaging

As systems grew more complex, asynchronous messaging became essential. I worked extensively with Oracle Tuxedo, IBM MQSeries, WebLogic JMS, WebSphere MQ, and later ActiveMQ, MQTT / AMQP, ZeroMQ and RabbitMQ primarily for inter-service communication and asynchronous processing. Here’s a JMS producer in Java:

import javax.jms.*;
import javax.naming.*;

public class OrderProducer {
    public static void main(String[] args) throws Exception {
        Context ctx = new InitialContext();
        ConnectionFactory factory = 
            (ConnectionFactory) ctx.lookup("ConnectionFactory");
        Queue queue = (Queue) ctx.lookup("OrderQueue");
        
        Connection connection = factory.createConnection();
        Session session = connection.createSession(
            false, Session.AUTO_ACKNOWLEDGE);
        MessageProducer producer = session.createProducer(queue);
        
        // Create message
        TextMessage message = session.createTextMessage();
        message.setText("{ \"orderId\": \"12345\", " +
                       "\"customerId\": \"67890\", " +
                       "\"amount\": 99.99 }");
        
        // Send message
        producer.send(message);
        System.out.println("Order sent: " + message.getText());
        
        connection.close();
    }
}

JMS consumer:

import javax.jms.*;
import javax.naming.*;

public class OrderConsumer implements MessageListener {
    
    public static void main(String[] args) throws Exception {
        Context ctx = new InitialContext();
        ConnectionFactory factory = 
            (ConnectionFactory) ctx.lookup("ConnectionFactory");
        Queue queue = (Queue) ctx.lookup("OrderQueue");
        
        Connection connection = factory.createConnection();
        Session session = connection.createSession(
            false, Session.AUTO_ACKNOWLEDGE);
        MessageConsumer consumer = session.createConsumer(queue);
        
        consumer.setMessageListener(new OrderConsumer());
        connection.start();
        
        System.out.println("Waiting for messages...");
        Thread.sleep(Long.MAX_VALUE);  // Keep running
    }
    
    public void onMessage(Message message) {
        try {
            TextMessage textMessage = (TextMessage) message;
            System.out.println("Received order: " + 
                             textMessage.getText());
            
            // Process order
            processOrder(textMessage.getText());
            
        } catch (JMSException e) {
            e.printStackTrace();
        }
    }
    
    private void processOrder(String orderJson) {
        // Business logic here
    }
}

Asynchronous messaging is essential for building resilient, scalable systems. It decouples producers from consumers, provides natural backpressure, and enables event-driven architectures.

Spring Framework and Aspect-Oriented Programming

In early 2000, I used aspect oriented programming (AOP) to inject cross cutting concerns like logging, security, monitoring, etc. Here is a typical example:

@Aspect
@Component
public class LoggingAspect {
    
    private static final Logger logger = 
        LoggerFactory.getLogger(LoggingAspect.class);
    
    @Before("execution(* com.example.service.*.*(..))")
    public void logBefore(JoinPoint joinPoint) {
        logger.info("Executing: " + 
                   joinPoint.getSignature().getName());
    }
    
    @AfterReturning(
        pointcut = "execution(* com.example.service.*.*(..))",
        returning = "result")
    public void logAfterReturning(JoinPoint joinPoint, Object result) {
        logger.info("Method " + 
                   joinPoint.getSignature().getName() + 
                   " returned: " + result);
    }
    
    @Around("@annotation(com.example.Monitored)")
    public Object measureTime(ProceedingJoinPoint joinPoint) 
            throws Throwable {
        long start = System.currentTimeMillis();
        Object result = joinPoint.proceed();
        long time = System.currentTimeMillis() - start;
        logger.info(joinPoint.getSignature().getName() + 
                   " took " + time + " ms");
        return result;
    }
}

I later adopted Spring Framework that revolutionized Java development with dependency injection and aspect-oriented programming (AOP):

// Spring configuration
@Configuration
public class AppConfig {
    
    @Bean
    public CustomerService customerService() {
        return new CustomerServiceImpl(customerRepository());
    }
    
    @Bean
    public CustomerRepository customerRepository() {
        return new DatabaseCustomerRepository(dataSource());
    }
    
    @Bean
    public DataSource dataSource() {
        DriverManagerDataSource ds = new DriverManagerDataSource();
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUrl("jdbc:mysql://localhost/mydb");
        return ds;
    }
}

// Service class
@Service
public class CustomerServiceImpl implements CustomerService {
    private final CustomerRepository repository;
    
    @Autowired
    public CustomerServiceImpl(CustomerRepository repository) {
        this.repository = repository;
    }
    
    @Transactional
    public void updateBalance(String customerId, double newBalance) {
        Customer customer = repository.findById(customerId);
        customer.setBalance(newBalance);
        repository.save(customer);
    }
}

Spring Remoting

Spring added its own remoting protocols. HTTP Invoker serialized Java objects over HTTP:

// Server configuration
@Configuration
public class ServerConfig {
    
    @Bean
    public HttpInvokerServiceExporter customerService() {
        HttpInvokerServiceExporter exporter = 
            new HttpInvokerServiceExporter();
        exporter.setService(customerServiceImpl());
        exporter.setServiceInterface(CustomerService.class);
        return exporter;
    }
}

// Client configuration
@Configuration
public class ClientConfig {
    
    @Bean
    public HttpInvokerProxyFactoryBean customerService() {
        HttpInvokerProxyFactoryBean proxy = 
            new HttpInvokerProxyFactoryBean();
        proxy.setServiceUrl("http://localhost:8080/customer");
        proxy.setServiceInterface(CustomerService.class);
        return proxy;
    }
}

I learned that AOP addressed cross-cutting concerns elegantly for monoliths. But in microservices, these concerns moved to the infrastructure layer like service meshes, API gateways, and sidecars.

Proprietary Protocols

When working for large companies like Amazon, I encountered Amazon Coral, which is a proprietary RPC framework influenced by CORBA. Coral used an IDL to define service interfaces and supported multiple languages:

// Coral IDL
namespace com.amazon.example

structure CustomerData {
    1: required integer customerId
    2: required string name
    3: optional double balance
}

service CustomerService {
    CustomerData getCustomer(1: integer customerId)
    void updateCustomer(1: CustomerData customer)
    list<CustomerData> listCustomers()
}

The IDL compiler generated client and server code for Java, C++, and other languages. Coral handled serialization, versioning, and service discovery. When I later worked for AWS, I used Smithy that was successor Coral, which Amazon open-sourced. Here is a similar example of a Smithy contract:

namespace com.example

service CustomerService {
    version: "2024-01-01"
    operations: [
        GetCustomer
        UpdateCustomer
        ListCustomers
    ]
}

@readonly
operation GetCustomer {
    input: GetCustomerInput
    output: GetCustomerOutput
    errors: [CustomerNotFound]
}

structure GetCustomerInput {
    @required
    customerId: String
}

structure GetCustomerOutput {
    @required
    customer: Customer
}

structure Customer {
    @required
    customerId: String
    
    @required
    name: String
    
    balance: Double
}

@error("client")
structure CustomerNotFound {
    @required
    message: String
}

I learned IDL-first design remains valuable. Smithy learned from CORBA, Protocol Buffers, and Thrift.

Long Polling, WebSockets, and Real-Time

In late 2000s, I built real-time applications for streaming financial charts and technical data. I used long polling where the client made a request that the server held open until data was available:

// Client-side long polling
function pollServer() {
    fetch('/api/events')
        .then(response => response.json())
        .then(data => {
            console.log('Received event:', data);
            updateUI(data);
            
            // Immediately poll again
            pollServer();
        })
        .catch(error => {
            console.error('Polling error:', error);
            // Retry after delay
            setTimeout(pollServer, 5000);
        });
}

pollServer();

Server-side (Node.js):

const express = require('express');
const app = express();

let pendingRequests = [];

app.get('/api/events', (req, res) => {
    // Hold request open
    pendingRequests.push(res);
    
    // Timeout after 30 seconds
    setTimeout(() => {
        const index = pendingRequests.indexOf(res);
        if (index !== -1) {
            pendingRequests.splice(index, 1);
            res.json({ type: 'heartbeat' });
        }
    }, 30000);
});

// When an event occurs
function broadcastEvent(event) {
    pendingRequests.forEach(res => {
        res.json(event);
    });
    pendingRequests = [];
}

WebSockets

I also used WebSockets for real time applications that supported true bidirectional communication. However, earlier browsers didn’t fully support them so I used long polling as a fallback when websockets were not supported:

// Server (Node.js with ws library)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
    console.log('Client connected');
    
    // Send initial data
    ws.send(JSON.stringify({
        type: 'INIT',
        data: getInitialData()
    }));
    
    // Handle messages
    ws.on('message', (message) => {
        const msg = JSON.parse(message);
        
        if (msg.type === 'SUBSCRIBE') {
            subscribeToSymbol(ws, msg.symbol);
        }
    });
    
    ws.on('close', () => {
        console.log('Client disconnected');
        unsubscribeAll(ws);
    });
});

// Stream live data
function streamPriceUpdate(symbol, price) {
    wss.clients.forEach((client) => {
        if (client.readyState === WebSocket.OPEN) {
            if (isSubscribed(client, symbol)) {
                client.send(JSON.stringify({
                    type: 'PRICE_UPDATE',
                    symbol: symbol,
                    price: price,
                    timestamp: Date.now()
                }));
            }
        }
    });
}

Client:

const ws = new WebSocket('ws://localhost:8080');

ws.onopen = () => {
    console.log('Connected to server');
    
    // Subscribe to symbols
    ws.send(JSON.stringify({
        type: 'SUBSCRIBE',
        symbol: 'AAPL'
    }));
};

ws.onmessage = (event) => {
    const message = JSON.parse(event.data);
    
    switch (message.type) {
        case 'INIT':
            initializeChart(message.data);
            break;
        case 'PRICE_UPDATE':
            updateChart(message.symbol, message.price);
            break;
    }
};

ws.onerror = (error) => {
    console.error('WebSocket error:', error);
};

ws.onclose = () => {
    console.log('Disconnected, attempting reconnect...');
    setTimeout(connectWebSocket, 1000);
};

I learned that different problems need different protocols. REST works for request-response. WebSockets excel for real-time bidirectional communication.

Vert.x and Hazelcast for High-Performance Streaming

For a production streaming chart system handling high-volume market data, I used Vert.x with Hazelcast. Vert.x is a reactive toolkit built on Netty that excels at handling thousands of concurrent connections with minimal resources. Hazelcast provided distributed caching and coordination across multiple Vert.x instances. Market data flowed into Hazelcast distributed topics, Vert.x instances subscribed to these topics and pushed updates to connected WebSocket clients. If WebSocket wasn’t supported, we fell back to long polling automatically.

import io.vertx.core.Vertx;
import io.vertx.core.http.HttpServer;
import io.vertx.core.http.ServerWebSocket;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.ITopic;
import com.hazelcast.core.Message;
import com.hazelcast.core.MessageListener;
import java.util.concurrent.ConcurrentHashMap;
import java.util.Set;

public class MarketDataServer {
    private final Vertx vertx;
    private final HazelcastInstance hazelcast;
    private final ConcurrentHashMap<String, Set<ServerWebSocket>> subscriptions;
    
    public MarketDataServer() {
        this.vertx = Vertx.vertx();
        this.hazelcast = Hazelcast.newHazelcastInstance();
        this.subscriptions = new ConcurrentHashMap<>();
        
        // Subscribe to market data topic
        ITopic<MarketData> topic = hazelcast.getTopic("market-data");
        topic.addMessageListener(new MessageListener<MarketData>() {
            public void onMessage(Message<MarketData> message) {
                broadcastToSubscribers(message.getMessageObject());
            }
        });
    }
    
    public void start() {
        HttpServer server = vertx.createHttpServer();
        
        server.webSocketHandler(ws -> {
            String path = ws.path();
            
            if (path.startsWith("/stream/")) {
                String symbol = path.substring(8);
                handleWebSocketConnection(ws, symbol);
            } else {
                ws.reject();
            }
        });
        
        // Long polling fallback
        server.requestHandler(req -> {
            if (req.path().startsWith("/poll/")) {
                String symbol = req.path().substring(6);
                handleLongPolling(req, symbol);
            }
        });
        
        server.listen(8080, result -> {
            if (result.succeeded()) {
                System.out.println("Market data server started on port 8080");
            }
        });
    }
    
    private void handleWebSocketConnection(ServerWebSocket ws, String symbol) {
        subscriptions.computeIfAbsent(symbol, k -> ConcurrentHashMap.newKeySet())
                     .add(ws);
        
        ws.closeHandler(v -> {
            Set<ServerWebSocket> sockets = subscriptions.get(symbol);
            if (sockets != null) {
                sockets.remove(ws);
            }
        });
        
        // Send initial snapshot from Hazelcast cache
        IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
        MarketData data = cache.get(symbol);
        if (data != null) {
            ws.writeTextMessage(data.toJson());
        }
    }
    
    private void handleLongPolling(HttpServerRequest req, String symbol) {
        String lastEventId = req.getParam("lastEventId");
        
        // Hold request until data available or timeout
        long timerId = vertx.setTimer(30000, id -> {
            req.response()
               .putHeader("Content-Type", "application/json")
               .end("{\"type\":\"heartbeat\"}");
        });
        
        // Register one-time listener
        subscriptions.computeIfAbsent(symbol + ":poll", 
            k -> ConcurrentHashMap.newKeySet())
            .add(new PollHandler(req, timerId));
    }
    
    private void broadcastToSubscribers(MarketData data) {
        String symbol = data.getSymbol();
        
        // WebSocket subscribers
        Set<ServerWebSocket> sockets = subscriptions.get(symbol);
        if (sockets != null) {
            String json = data.toJson();
            sockets.forEach(ws -> {
                if (!ws.isClosed()) {
                    ws.writeTextMessage(json);
                }
            });
        }
        
        // Update Hazelcast cache for new subscribers
        IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
        cache.put(symbol, data);
    }
    
    public static void main(String[] args) {
        new MarketDataServer().start();
    }
}

Publishing market data to Hazelcast from data feed:

public class MarketDataPublisher {
    private final HazelcastInstance hazelcast;
    
    public void publishUpdate(String symbol, double price, long volume) {
        MarketData data = new MarketData(symbol, price, volume, 
                                         System.currentTimeMillis());
        
        // Publish to topic - all Vert.x instances receive it
        ITopic<MarketData> topic = hazelcast.getTopic("market-data");
        topic.publish(data);
    }
}

This architecture provided:

  • Vert.x Event Loop: Non-blocking I/O handled 10,000+ concurrent WebSocket connections per instance
  • Hazelcast Distribution: Market data shared across multiple Vert.x instances without a central message broker
  • Horizontal Scaling: Adding Vert.x instances automatically joined the Hazelcast cluster
  • Low Latency: Sub-millisecond message propagation within the cluster
  • Automatic Fallback: Clients detected WebSocket support; older browsers used long polling

Facebook Thrift and Google Protocol Buffers

I experimented with Facebook Thrift and Google Protocol Buffers that provided IDL-based RPC with multiple protocols: Here is an example of Protocol Buffers:

syntax = "proto3";

package customer;

message Customer {
    int32 customer_id = 1;
    string name = 2;
    double balance = 3;
}

service CustomerService {
    rpc GetCustomer(GetCustomerRequest) returns (Customer);
    rpc UpdateBalance(UpdateBalanceRequest) returns (UpdateBalanceResponse);
    rpc ListCustomers(ListCustomersRequest) returns (CustomerList);
}

message GetCustomerRequest {
    int32 customer_id = 1;
}

message UpdateBalanceRequest {
    int32 customer_id = 1;
    double new_balance = 2;
}

message UpdateBalanceResponse {
    bool success = 1;
}

message ListCustomersRequest {}

message CustomerList {
    repeated Customer customers = 1;
}

Python server with gRPC (which uses Protocol Buffers):

import grpc
from concurrent import futures
import customer_pb2
import customer_pb2_grpc

class CustomerServicer(customer_pb2_grpc.CustomerServiceServicer):
    
    def GetCustomer(self, request, context):
        return customer_pb2.Customer(
            customer_id=request.customer_id,
            name="John Doe",
            balance=5000.00
        )
    
    def UpdateBalance(self, request, context):
        print(f"Updating balance for {request.customer_id} " +
              f"to {request.new_balance}")
        return customer_pb2.UpdateBalanceResponse(success=True)
    
    def ListCustomers(self, request, context):
        customers = [
            customer_pb2.Customer(customer_id=1, name="Alice", balance=1000),
            customer_pb2.Customer(customer_id=2, name="Bob", balance=2000),
        ]
        return customer_pb2.CustomerList(customers=customers)

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    customer_pb2_grpc.add_CustomerServiceServicer_to_server(
        CustomerServicer(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    print("Server started on port 50051")
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

I learned that binary protocols offer significant efficiency gains. JSON is human-readable and convenient for debugging, but in high-performance scenarios, binary protocols like Protocol Buffers reduce payload size and serialization overhead.

Serverless and Lambda: Functions as a Service

Around 2015, AWS Lambda introduced serverless computing where you wrote functions, and AWS handled all the infrastructure:

// Lambda function (Node.js)
exports.handler = async (event) => {
    const customerId = event.queryStringParameters.customerId;
    
    // Query DynamoDB
    const AWS = require('aws-sdk');
    const dynamodb = new AWS.DynamoDB.DocumentClient();
    
    const result = await dynamodb.get({
        TableName: 'Customers',
        Key: { customerId: customerId }
    }).promise();
    
    if (result.Item) {
        return {
            statusCode: 200,
            body: JSON.stringify(result.Item)
        };
    } else {
        return {
            statusCode: 404,
            body: JSON.stringify({ error: 'Customer not found' })
        };
    }
};

Serverless was powerful with no servers to manage, automatic scaling, pay-per-invocation pricing. It felt like the Actor model I’d worked for my research that offered small, stateless, event-driven functions.

However, I also encountered several problems with serverless:

  • Cold starts: First invocation could be slow (though it has improved with recent updates)
  • Timeouts: Functions had maximum execution time (15 minutes for Lambda)
  • State management: Functions were stateless; you needed external state stores
  • Orchestration: Coordinating multiple functions was complex

The ping-pong anti-pattern emerged where Lambda A calls Lambda B, which calls Lambda C, which calls Lambda D. This created hard-to-debug systems with unpredictable costs. AWS Step Functions and Azure Durable Functions addressed orchestration:

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:CheckInventory",
      "Next": "ChargeCustomer"
    },
    "ChargeCustomer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ChargeCustomer",
      "Catch": [{
        "ErrorEquals": ["PaymentError"],
        "Next": "PaymentFailed"
      }],
      "Next": "ShipOrder"
    },
    "ShipOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ShipOrder",
      "End": true
    },
    "PaymentFailed": {
      "Type": "Fail",
      "Cause": "Payment processing failed"
    }
  }
}

gRPC: Modern RPC

In early 2020s, I started using gRPC extensively. It combined the best ideas from decades of RPC evolution:

  • Protocol Buffers for IDL
  • HTTP/2 for transport (multiplexing, header compression, flow control)
  • Strong typing with code generation
  • Streaming support (unary, server streaming, client streaming, bidirectional)

Here’s a gRPC service definition:

syntax = "proto3";

package customer;

service CustomerService {
    rpc GetCustomer(GetCustomerRequest) returns (Customer);
    rpc UpdateCustomer(Customer) returns (UpdateResponse);
    rpc StreamOrders(StreamOrdersRequest) returns (stream Order);
    rpc BidirectionalChat(stream ChatMessage) returns (stream ChatMessage);
}

message Customer {
    int32 customer_id = 1;
    string name = 2;
    double balance = 3;
}

message GetCustomerRequest {
    int32 customer_id = 1;
}

message UpdateResponse {
    bool success = 1;
    string message = 2;
}

message StreamOrdersRequest {
    int32 customer_id = 1;
}

message Order {
    int32 order_id = 1;
    double amount = 2;
    string status = 3;
}

message ChatMessage {
    string user = 1;
    string message = 2;
    int64 timestamp = 3;
}

Go server implementation:

package main

import (
    "context"
    "fmt"
    "log"
    "net"
    "time"
    
    "google.golang.org/grpc"
    pb "example.com/customer"
)

type server struct {
    pb.UnimplementedCustomerServiceServer
}

func (s *server) GetCustomer(ctx context.Context, req *pb.GetCustomerRequest) (*pb.Customer, error) {
    return &pb.Customer{
        CustomerId: req.CustomerId,
        Name:       "John Doe",
        Balance:    5000.00,
    }, nil
}

func (s *server) UpdateCustomer(ctx context.Context, customer *pb.Customer) (*pb.UpdateResponse, error) {
    log.Printf("Updating customer %d", customer.CustomerId)
    
    return &pb.UpdateResponse{
        Success: true,
        Message: "Customer updated successfully",
    }, nil
}

func (s *server) StreamOrders(req *pb.StreamOrdersRequest, stream pb.CustomerService_StreamOrdersServer) error {
    orders := []*pb.Order{
        {OrderId: 1, Amount: 99.99, Status: "shipped"},
        {OrderId: 2, Amount: 149.50, Status: "processing"},
        {OrderId: 3, Amount: 75.25, Status: "delivered"},
    }
    
    for _, order := range orders {
        if err := stream.Send(order); err != nil {
            return err
        }
        time.Sleep(time.Second)  // Simulate delay
    }
    
    return nil
}

func (s *server) BidirectionalChat(stream pb.CustomerService_BidirectionalChatServer) error {
    for {
        msg, err := stream.Recv()
        if err != nil {
            return err
        }
        
        log.Printf("Received: %s from %s", msg.Message, msg.User)
        
        // Echo back with server prefix
        response := &pb.ChatMessage{
            User:      "Server",
            Message:   fmt.Sprintf("Echo: %s", msg.Message),
            Timestamp: time.Now().Unix(),
        }
        
        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

func main() {
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }
    
    s := grpc.NewServer()
    pb.RegisterCustomerServiceServer(s, &server{})
    
    log.Println("Server listening on :50051")
    if err := s.Serve(lis); err != nil {
        log.Fatalf("Failed to serve: %v", err)
    }
}

Go client:

package main

import (
    "context"
    "io"
    "log"
    "time"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    pb "example.com/customer"
)

func main() {
    conn, err := grpc.Dial("localhost:50051", 
        grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer conn.Close()
    
    client := pb.NewCustomerServiceClient(conn)
    ctx := context.Background()
    
    // Unary call
    customer, err := client.GetCustomer(ctx, &pb.GetCustomerRequest{
        CustomerId: 12345,
    })
    if err != nil {
        log.Fatalf("GetCustomer failed: %v", err)
    }
    log.Printf("Customer: %v", customer)
    
    // Server streaming
    stream, err := client.StreamOrders(ctx, &pb.StreamOrdersRequest{
        CustomerId: 12345,
    })
    if err != nil {
        log.Fatalf("StreamOrders failed: %v", err)
    }
    
    for {
        order, err := stream.Recv()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatalf("Receive error: %v", err)
        }
        log.Printf("Order: %v", order)
    }
}

The Load Balancing Challenge

gRPC had one major gotcha in Kubernetes: connection persistence breaks load balancing. I documented this exhaustively in my blog post The Complete Guide to gRPC Load Balancing in Kubernetes and Istio. HTTP/2 multiplexes multiple requests over a single TCP connection. Once that connection is established to one pod, all requests go there. Kubernetes Service load balancing happens at L4 (TCP), so it doesn’t see individual gRPC calls and it only sees one connection. I used Istio’s Envoy sidecar, which operates at L7 and routes each gRPC call independently:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service
spec:
  host: grpc-service
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 100
        maxRequestsPerConnection: 10  # Force connection rotation
    loadBalancer:
      simple: LEAST_REQUEST  # Better than ROUND_ROBIN
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s

I learned that modern protocols solve old problems but introduce new ones. gRPC is excellent, but you must understand how it interacts with infrastructure. Production systems require deep integration between application protocol and deployment environment.

Modern Messaging and Streaming

I have been using Apache Kafka for many years that transformed how we think about data. It’s not just a message queue instead it’s a distributed commit log:

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

order = {
    'order_id': '12345',
    'customer_id': '67890',
    'amount': 99.99,
    'timestamp': time.time()
}

producer.send('orders', value=order)
producer.flush()

# Consumer
consumer = KafkaConsumer(
    'orders',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    group_id='order-processors'
)

for message in consumer:
    order = message.value
    print(f"Processing order: {order['order_id']}")
    # Process order

Kafka’s provided:

  • Durability: Messages are persisted to disk
  • Replayability: Consumers can reprocess historical events
  • Partitioning: Horizontal scalability through partitions
  • Consumer groups: Multiple consumers can process in parallel

Key Lesson: Event-driven architectures enable loose coupling and temporal decoupling. Systems can be rebuilt from the event log. This is Event Sourcing—a powerful pattern that Kafka makes practical at scale.

Agentic RPC: MCP and Agent-to-Agent Protocol

Over the last year, I have been building Agentic AI applications using Model Context Protocol (MCP) and more recently Agent-to-Agent (A2A) protocol. Both use JSON-RPC 2.0 underneath. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, we’ve come full circle to JSON-RPC for AI agents. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, it has come full circle to JSON-RPC for AI agents.

Service Discovery

A2A immediately reminded me of Sun’s Network Information Service (NIS), originally called Yellow Pages that I used in early 1990s. NIS provided a centralized directory service for Unix systems to look up user accounts, host names, and configuration data across a network. I saw this pattern repeated throughout the decades:

  • CORBA Naming Service (1990s): Objects registered themselves with a hierarchical naming service, and clients discovered them by name
  • JINI (late 1990s): Services advertised themselves via multicast, and clients discovered them through lookup registrars (as I described earlier in the JINI section)
  • UDDI (2000s): Universal Description, Discovery, and Integration for web services—a registry where SOAP services could be published and discovered
  • Consul, Eureka, etcd (2010s): Modern service discovery for microservices
  • Kubernetes DNS/Service Discovery (2010s-present): Built-in service registry and DNS-based discovery

Model Context Protocol (MCP)

MCP lets AI agents discover and invoke tools provided by servers. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. Here’s the MCP server that exposes tools to the AI agent:

from mcp.server import Server
import mcp.types as types
from typing import Any
import asyncio

class DailyMinutesServer:
    def __init__(self):
        self.server = Server("daily-minutes")
        self.setup_handlers()
        
    def setup_handlers(self):
        @self.server.list_tools()
        async def handle_list_tools() -> list[types.Tool]:
            return [
                types.Tool(
                    name="get_emails",
                    description="Fetch recent emails from inbox",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "hours": {
                                "type": "number",
                                "description": "Hours to look back"
                            },
                            "limit": {
                                "type": "number", 
                                "description": "Max emails to fetch"
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_hackernews",
                    description="Fetch top Hacker News stories",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "limit": {
                                "type": "number",
                                "description": "Number of stories"
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_rss_feeds",
                    description="Fetch latest RSS feed items",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "feed_urls": {
                                "type": "array",
                                "items": {"type": "string"}
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_weather",
                    description="Get current weather forecast",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "location": {"type": "string"}
                        }
                    }
                )
            ]
        
        @self.server.call_tool()
        async def handle_call_tool(
            name: str, 
            arguments: dict[str, Any]
        ) -> list[types.TextContent]:
            if name == "get_emails":
                result = await email_connector.fetch_recent(
                    hours=arguments.get("hours", 24),
                    limit=arguments.get("limit", 10)
                )
            elif name == "get_hackernews":
                result = await hn_connector.fetch_top_stories(
                    limit=arguments.get("limit", 10)
                )
            elif name == "get_rss_feeds":
                result = await rss_connector.fetch_feeds(
                    feed_urls=arguments["feed_urls"]
                )
            elif name == "get_weather":
                result = await weather_connector.get_forecast(
                    location=arguments["location"]
                )
            else:
                raise ValueError(f"Unknown tool: {name}")
            
            return [types.TextContent(
                type="text",
                text=json.dumps(result, indent=2)
            )]

Each connector is a simple async module. Here’s the Hacker News connector:

import aiohttp
from typing import List, Dict

class HackerNewsConnector:
    BASE_URL = "https://hacker-news.firebaseio.com/v0"
    
    async def fetch_top_stories(self, limit: int = 10) -> List[Dict]:
        async with aiohttp.ClientSession() as session:
            # Get top story IDs
            async with session.get(f"{self.BASE_URL}/topstories.json") as resp:
                story_ids = await resp.json()
            
            # Fetch details for top N stories
            stories = []
            for story_id in story_ids[:limit]:
                async with session.get(
                    f"{self.BASE_URL}/item/{story_id}.json"
                ) as resp:
                    story = await resp.json()
                    stories.append({
                        "title": story.get("title"),
                        "url": story.get("url"),
                        "score": story.get("score"),
                        "by": story.get("by"),
                        "time": story.get("time")
                    })
            
            return stories

RSS and weather connectors follow the same pattern—simple, focused modules that the MCP server orchestrates.

JSON-RPC Under the Hood

MCP is that it’s just JSON-RPC 2.0 over stdio or HTTP. Here’s what a tool call looks like on the wire:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "get_emails",
    "arguments": {
      "hours": 12,
      "limit": 5
    }
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [
      {
        "type": "text",
        "text": "[{\"from\": \"john@example.com\", \"subject\": \"Q4 Review\", ...}]"
      }
    ]
  }
}

After using Sun RPC, CORBA, SOAP, and gRPC, I appreciate MCP’s simplicity. It solves a specific problem: letting AI agents discover and invoke tools.

The Agent Workflow

My daily minutes agent follows this workflow:

  1. Agent calls get_emails to fetch recent messages
  2. Agent calls get_hackernews for tech news
  3. Agent calls get_rss_feeds for blog updates
  4. Agent calls get_weather for local forecast
  5. Agent synthesizes everything into a concise morning briefing

The AI decides which tools to call, in what order, based on the user’s preferences. I don’t hardcode the workflow.

Agent-to-Agent Protocol (A2A)

While MCP focuses on tool calling, A2A addresses agent-to-agent discovery and communication. It’s the modern equivalent of NIS/Yellow Pages for agents. Agents register their capabilities in a directory, and other agents discover and invoke them. A2A also uses JSON-RPC 2.0, but adds a discovery layer. Here’s how an agent registers itself:

from a2a import Agent, Capability

class ResearchAgent(Agent):
    def __init__(self):
        super().__init__(
            agent_id="research-agent-01",
            name="Research Agent",
            description="Performs web research and summarization"
        )
        
        # Register capabilities
        self.register_capability(
            Capability(
                name="web_search",
                description="Search the web for information",
                input_schema={
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "max_results": {"type": "integer", "default": 10}
                    },
                    "required": ["query"]
                },
                output_schema={
                    "type": "object",
                    "properties": {
                        "results": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "title": {"type": "string"},
                                    "url": {"type": "string"},
                                    "snippet": {"type": "string"}
                                }
                            }
                        }
                    }
                }
            )
        )
    
    async def handle_request(self, capability: str, params: dict):
        if capability == "web_search":
            return await self.perform_web_search(
                query=params["query"],
                max_results=params.get("max_results", 10)
            )
    
    async def perform_web_search(self, query: str, max_results: int):
        # Actual search implementation
        results = await search_engine.search(query, limit=max_results)
        return {"results": results}

Another agent discovers and invokes the research agent:

class CoordinatorAgent(Agent):
    def __init__(self):
        super().__init__(
            agent_id="coordinator-01",
            name="Coordinator Agent"
        )
        self.directory = AgentDirectory()
    
    async def research_topic(self, topic: str):
        # Discover agents with web_search capability
        agents = await self.directory.find_agents_with_capability("web_search")
        
        if not agents:
            raise Exception("No research agents available")
        
        # Select an agent (load balancing, availability, etc.)
        research_agent = agents[0]
        
        # Invoke the capability via JSON-RPC
        result = await research_agent.invoke(
            capability="web_search",
            params={
                "query": topic,
                "max_results": 20
            }
        )
        
        return result

The JSON-RPC exchange looks like this:

Discovery request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "directory.find_agents",
  "params": {
    "capability": "web_search",
    "filters": {
      "availability": "online"
    }
  }
}

Discovery response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "agents": [
      {
        "agent_id": "research-agent-01",
        "name": "Research Agent",
        "endpoint": "http://agent-service:8080/rpc",
        "capabilities": ["web_search"],
        "metadata": {
          "load": 0.3,
          "response_time_ms": 150
        }
      }
    ]
  }
}

The Security Problem

Though, I appreciate the simplicity of MCP and A2A but here’s what worries me: both protocols largely ignore decades of hard-won lessons about security. The Salesloft breach in 2024 showed exactly what happens: their AI chatbot stored authentication tokens for hundreds of services. MCP and A2A give us standard protocols for tool calling and agent coordination, which is valuable. But they create a false sense of security while ignoring fundamentals we solved decades ago:

  • Authentication: How do we verify an agent’s identity?
  • Authorization: What capabilities should this agent have access to?
  • Credential rotation: How do we handle token expiration and renewal?
  • Observability: How do we trace agent interactions for debugging and auditing?
  • Principle of least privilege: How do we ensure agents only access what they need?
  • Rate limiting: How do we prevent a misbehaving agent from overwhelming services?

The community needs to address this before A2A and MCP see widespread enterprise adoption.

Lessons Learned

1. Complexity is the Enemy

Every failed technology I’ve used failed because of complexity. CORBA, SOAP, EJB—they all collapsed under their own weight. Successful technologies like REST, gRPC, Kafka focused on doing one thing well.

Implication: Be suspicious of solutions that try to solve every problem. Prefer composable, focused tools.

2. Network Calls Are Expensive

The first Fallacy of Distributed Computing haunts us still: The network is not reliable. It’s also not zero latency, infinite bandwidth, or secure. I’ve watched this lesson be relearned in every generation:

  • EJB entity beans made chatty network calls
  • Microservices make chatty REST calls
  • GraphQL makes chatty database queries

Implication: Design APIs to minimize round trips. Batch operations. Cache aggressively. Monitor network latency religiously. (See my blog on fault tolerance in microservices for details.)

3. Statelessness Scales

Stateless services scale horizontally. But real applications need state—session data, shopping carts, user preferences. The solution isn’t to make services stateful instead it’s to externalize state:

  • Session stores (Redis, Memcached)
  • Databases (PostgreSQL, DynamoDB)
  • Event logs (Kafka)
  • Distributed caches

Implication: Keep service logic stateless. Push state to specialized systems designed for it.

4. The Actor Model Is Underappreciated

My research with actors and Linda memory model convinced me that the Actor model simplifies concurrent and distributed systems. Today’s serverless functions are essentially actors. Frameworks like Akka, Orleans, and Dapr embrace it. Actors eliminate shared mutable shared state, which the source of most concurrency bugs.

Implication: For event-driven systems, consider Actor-based frameworks. They map naturally to distributed problems.

5. Observability

Modern distributed systems require extensive instrumentation. You need:

  • Structured logging with correlation IDs
  • Metrics for performance and health
  • Distributed tracing to follow requests across services
  • Alarms with proper thresholds

Implication: Instrument your services from day one. Observability is infrastructure, not a nice-to-have. (See my blog posts on fault tolerance and load shedding for specific metrics.)

6. Throttling and Load Shedding

Every production system eventually faces traffic spikes or DDoS attacks. Without throttling and load shedding, your system will collapse. Key techniques:

  • Rate limiting by client/user/IP
  • Admission control based on queue depth
  • Circuit breakers to fail fast
  • Backpressure to slow down producers

Implication: Build throttling and load shedding into your architecture early. They’re harder to retrofit. (See my comprehensive blog post on this topic.)

7. Idempotency

Network failures mean requests may be retried. If your operations aren’t idempotent, you’ll process payments twice, create duplicate orders, and corrupt data (See my blog on idempotency topic). Make operations idempotent:

  • Use idempotency keys
  • Check if operation already succeeded
  • Design APIs to be safely retryable

Implication: Every non-read operation should be idempotent. It saves you from a world of hurt.

8. External and Internal APIs Should Differ

I have learned that external APIs need a good UX and developer empathy so that APIs are intuitive, consistent, well-documented. Internal APIs can optimize for performance, reliability, and operational needs. Don’t expose your internal architecture to external consumers. Use API gateways to translate between external contracts and internal services.

Implication: Design external APIs for developers using them. Design internal APIs for operational excellence.

9. Standards Beat Proprietary Solutions

Novell IPX failed because it was proprietary. Sun RPC succeeded as an open standard. REST thrived because it built on HTTP. gRPC uses open standards (HTTP/2, Protocol Buffers).

Implication: Prefer open standards. If you must use proprietary tech, understand the exit strategy.

10. Developer Experience Matters

Technologies with great developer experience get adopted. Java succeeded because it was easier than C++. REST beat SOAP because it was simpler. Kubernetes won because it offered a powerful abstraction.

Implication: Invest in developer tools, documentation, and ergonomics. Friction kills momentum.

Upcoming Trends

WebAssembly: The Next Runtime

WebAssembly (Wasm) is emerging as a universal runtime. Code written in Rust, Go, C, or AssemblyScript compiles to Wasm and runs anywhere. Platforms like wasmCloud, Fermyon, and Lunatic are building Actor-based systems on Wasm. Combined with the Component Model and WASI (WebAssembly System Interface), Wasm offers near-native performance, strong sandboxing, and portability. It might replace Docker containers for some workloads. Solomon Hykes, creator of Docker, famously said:

“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!” — Solomon Hykes, March 2019

WebAssembly isn’t ready yet. Critical gaps:

  • WASI maturity: Still evolving (Preview 2 in development)
  • Async I/O: Limited compared to native runtimes
  • Database drivers: Many don’t support WASM
  • Networking: WASI sockets still experimental
  • Ecosystem tooling: Debugging, profiling still primitive

Service Meshes

Istio, Linkerd, Dapr move cross-cutting concerns out of application code:

  • Authentication/authorization
  • Rate limiting
  • Circuit breaking
  • Retries with exponential backoff
  • Distributed tracing
  • Metrics collection

Tradeoff: Complexity shifts from application code to infrastructure. Teams need deep Kubernetes and service mesh expertise.

The Edge Is Growing

Edge computing brings computation closer to users. CDNs like Cloudflare Workers and Fastly Compute@Edge run code globally with single-digit millisecond latency. This requires new thinking like eventual consistency, CRDTs (Conflict-free Replicated Data Types), and geo-distributed state management.

AI Agents and Multi-Agent Systems

I’m currently building agentic AI systems using LangGraph, RAG, and MCP. These are inherently distributed and agents communicate asynchronously, maintain local state, and coordinate through message passing. It’s the Actor model again.

What’s Missing

Despite all this progress, we still struggle with:

  • Distributed transactions: Two-phase commit doesn’t scale; SAGA patterns are complex
  • Testing distributed systems: Mocking services, simulating failures, and reproducing production bugs remain hard. I have written a number of tools for mock testing.
  • Observability at scale: Tracing millions of requests generates too much data
  • Cost management: Cloud bills spiral as systems grow
  • Cognitive load: Modern systems require expertise in dozens of technologies

Conclusion

I’ve been writing network code for decades and have used dozens of protocols, frameworks, and paradigms. Here is what I have learned:

  • Simplicity beats complexity (SOAP died, REST thrived)
  • Network calls aren’t free (EJB entity beans, chatty microservices)
  • State is hard; externalize it (Erlang, serverless functions)
  • Observability is essential (You can’t fix what you can’t see)
  • Developer experience matters (Java beat C++, REST beat SOAP)
  • Make It Work, Then Make It Fast
  • Design for Failure from Day One (Systems built with circuit breakers, retries, timeouts, and graceful degradation from the start).

Other tips from evolution of remote services include:

  • Design systems as message-passing actors from the start. Whether that’s Erlang processes, Akka actors, Orleans grains, or Lambda functions—embrace isolated state and message passing.
  • Invest in Observability with structured logging with correlation IDs, instrumented metrics, distributed tracing and alarms.
  • Separate External and Internal APIs. Use REST or GraphQL for external APIs (with versioning) and use gRPC or Thrift for internal communication (efficient).
  • Build Throttling and Load Shedding by rate limiting by client/user/IP at the edge and implement admission control at the service level (See my blog on Effective Load Shedding and Throttling).
  • Make Everything Idempotent as networks fail and requests get retried. Use idempotency keys for all mutations.
  • Choose Boring Technology (See Choose Boring Technology). For your core infrastructure, use proven tech (PostgreSQL, Redis, Kafka).
  • Test for Failure. Most code only handles the happy path. Production is all about unhappy paths.
  • Learn about the Fallacies of Distributed Computing and read A Note on Distributed Computing (1994).
  • Make chaos engineering part of CI/CD and use property-based testing (See my blog on property-based testing).

The technologies change like mainframes to serverless, Assembly to Go, CICS to Kubernetes. But the underlying principles remain constant. We oscillate between extremes:

  • Monoliths -> Microservices -> (now) Modular Monoliths
  • Strongly typed IDLs (CORBA) -> Untyped JSON -> Strongly typed again (gRPC)
  • Centralized -> Distributed -> Edge -> (soon) Peer-to-peer?
  • Synchronous RPC -> Asynchronous messaging -> Reactive streams

Each swing teaches us something. CORBA was too complex, but IDL-first design is valuable. REST was liberating, but binary protocols are more efficient. Microservices enable agility, but operational complexity explodes. The sweet spot is usually in the middle. Modular monoliths with clear boundaries. REST for external APIs, gRPC for internal communication. Some synchronous calls, some async messaging.

Here are a few trends that I see becoming prevalent:

  1. WebAssembly may replace containers for some workloads: Faster startup, better security with platforms like wasmCloud and Fermyon.
  2. Service meshes are becoming invisible: Currently they are too complex. Ambient mesh (no sidecars) and eBPF-based routing are gaining wider adoption.
  3. The Actor model will eat the world: Serverless functions are actors and durable functions are actor orchestration.
  4. Edge computing will force new patterns: We can’t rely on centralized state and may need CRDTs and eventual consistency.
  5. AI agents will need distributed coordination. Multi-agent systems = distributed systems and may need message passing between agents.

The best engineers don’t just learn the latest framework, they study the history, understand the trade-offs, and recognize when old ideas solve new problems. The future of distributed systems won’t be built by inventing entirely new paradigms instead it’ll be built by taking the best ideas from the past, learning from the failures, and applying them with better tools.


Check out my other blog posts:

October 21, 2025

Pragmatic Agentic AI: How I Rebuilt Years of FinTech Infrastructure with ReAct, RAG, and Free Local Models

Filed under: Agentic AI,Computing — Tags: , , , , — admin @ 1:00 pm

I spent over a decade in FinTech building the systems traders rely on every day like high-performance APIs streaming real-time charts, technical indicator calculators processing millions of data points per second, and comprehensive analytical platforms ingesting SEC 10-Ks and 10-Qs into distributed databases. We used to parse XBRL filings, ran news/sentiment analysis on earnings calls using early NLP models to detect market anomalies.

Over the past couple of years, I’ve been building AI agents and creating automated workflows that tackle complex problems using agentic AI. I’m also revisiting challenges I hit while building trading tools for fintech companies. For example, the AI I’m working with now reasons about which analysis to run. It grasps context, retrieves information on demand, and orchestrates complex workflows autonomously. It applies Black-Scholes when needed, switches to technical analysis when appropriate, and synthesizes insights from multiple sources—no explicit rules required.

The best part is that I’m running this entire system on my laptop using Ollama and open-source models. Zero API costs during development. When I need production scale, I can switch to cloud APIs with a few lines of code. I will walk you through this journey of rebuilding financial analysis with agentic AI – from traditional algorithms to thinking machines and from rigid pipelines to adaptive workflows.

Why This Approach Changes Everything

Traditional financial systems process data. Agentic AI systems understand objectives and figure out how to achieve them. That’s the fundamental difference that took me a while to fully grasp. And unlike my old systems that required separate codebases for each type of analysis, this one uses the same underlying patterns for everything.

The Money-Saving Secret: Local Development with Ollama

Here’s something that would have saved my startup thousands: you can build and test sophisticated AI systems entirely locally using Ollama. No API keys, no usage limits, no surprise bills.

# This runs entirely on your machine - zero external API calls
from langchain_ollama import OllamaLLM as Ollama

# Local LLM for development and testing
dev_llm = Ollama(
    model="llama3.2:latest",      # 3.2GB model that runs on most laptops
    temperature=0.7,
    base_url="http://localhost:11434"  # Your local Ollama instance
)

# When ready for production, switch to cloud providers
from langchain_openai import ChatOpenAI

prod_llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.7
)

# The beautiful part? Same interface, same code
def analyze_stock(llm, ticker):
    # This function works with both local and cloud LLMs
    prompt = f"Analyze {ticker} stock fundamentals"
    return llm.invoke(prompt)

During development, I run hundreds of experiments daily without spending a cent. Once the prompts and workflows are refined, switching to cloud APIs is literally changing one line of code.

Understanding ReAct: How AI Learns to Think Step-by-Step

ReAct (Reasoning and Acting) was the first pattern that made me realize we weren’t just building chatbots anymore. Let me show you exactly how it works with real code from my system.

The Human Thought Process We’re Mimicking

When I manually analyzed stocks, my mental process looked something like this:

  1. “I need to check if Apple is overvalued”
  2. “Let me get the current P/E ratio”
  3. “Hmm, 28.5 seems high, but what’s the industry average?”
  4. “Tech sector average is 25, so Apple is slightly premium”
  5. “But wait, what’s their growth rate?”
  6. “15% annual growth… that PEG ratio of 1.9 suggests fair value”
  7. “Let me check recent news for any red flags…”

ReAct agents follow this exact pattern. Here’s the actual implementation:

class ReActAgent:
    """ReAct Agent that demonstrates reasoning traces"""
    
    # This is the actual prompt from the project
    REACT_PROMPT = """You are a financial analysis agent that uses the ReAct framework to solve problems.

You have access to the following tools:

{tools_description}

Use the following format EXACTLY:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, must be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin! Remember to ALWAYS follow the format exactly.

Question: {question}
Thought: {scratchpad}"""

    def _parse_response(self, response: str) -> Tuple[str, str, str, bool]:
        """Parse LLM response to extract thought, action, and input"""
        response = response.strip()
        
        # Check for final answer
        if "Final Answer:" in response:
            parts = response.split("Final Answer:")
            thought = parts[0].strip()
            final_answer = parts[1].strip()
            return thought, "final_answer", final_answer, True
        
        # Parse using regex from actual implementation
        thought_match = re.search(r"Thought:\s*(.+?)(?=Action:|$)", response, re.DOTALL)
        action_match = re.search(r"Action:\s*(.+?)(?=Action Input:|$)", response, re.DOTALL)
        input_match = re.search(r"Action Input:\s*(.+?)(?=Observation:|$)", response, re.DOTALL)
        
        thought = thought_match.group(1).strip() if thought_match else "Thinking..."
        action = action_match.group(1).strip() if action_match else "unknown"
        action_input = input_match.group(1).strip() if input_match else ""
        
        return thought, action, action_input, False

I can easily trace through reasoning to debug how AI reached its conclusion.

RAG: Solving the Hallucination Problem Once and For All

Early in my experiments, I had to deal with a bit of hallucinations when querying financial data with AI so I applied RAG (Retrieval-Augmented Generation) to give AI access to a searchable library of documents.

How RAG Actually Works

You can think of RAG like having a research assistant who, instead of relying on memory, always checks the source documents before answering:

class RAGEngine:
    """
    This engine solved my hallucination problems by grounding 
    all responses in actual documents. It's like giving the AI
    access to your company's document database.
    """
    
    def __init__(self):
        # Initialize embeddings - this converts text to searchable vectors
        # Using Ollama's local embedding model (free!)
        self.embeddings = OllamaEmbeddings(
            model="nomic-embed-text:latest"  # 274MB model, runs fast
        )
        
        # Text splitter - crucial for handling large documents
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,      # Small enough for context window
            chunk_overlap=50,    # Overlap prevents losing context at boundaries
            separators=["\n\n", "\n", ". ", " "]  # Smart splitting
        )
        
        # Vector store - where we keep our searchable documents
        self.vector_store = FAISS.from_texts(["init"], self.embeddings)
    
    def load_financial_documents(self, ticker: str):
        """
        In production, this would load real 10-Ks, 10-Qs, earnings calls.
        For now, I'm using sample documents to demonstrate the concept.
        """
        
        # Imagine these are real SEC filings
        documents = [
            {
                "content": f"""
                {ticker} Q3 2024 Earnings Report
                
                Revenue: $94.9 billion, up 6% year over year
                iPhone revenue: $46.2 billion
                Services revenue: $23.3 billion (all-time record)
                
                Gross margin: 45.2%
                Operating cash flow: $28.7 billion
                
                CEO Tim Cook: "We're incredibly pleased with our record 
                September quarter results and strong momentum heading into
                the holiday season."
                """,
                "metadata": {
                    "source": "10-Q Filing",
                    "date": "2024-10-31",
                    "document_type": "earnings_report",
                    "ticker": ticker
                }
            },
            # ... more documents
        ]
        
        # Process each document
        for doc in documents:
            # Split into chunks
            chunks = self.text_splitter.split_text(doc["content"])
            
            # Create document objects with metadata
            for i, chunk in enumerate(chunks):
                metadata = doc["metadata"].copy()
                metadata["chunk_id"] = i
                metadata["total_chunks"] = len(chunks)
                
                # Add to vector store
                self.vector_store.add_texts(
                    texts=[chunk],
                    metadatas=[metadata]
                )
        
        print(f"? Loaded {len(documents)} documents for {ticker}")
    
    def answer_with_sources(self, question: str) -> Dict[str, Any]:
        """
        This is where RAG shines - every answer comes with sources
        """
        # Find relevant document chunks
        relevant_docs = self.vector_store.similarity_search_with_score(
            question, 
            k=5  # Top 5 most relevant chunks
        )
        
        # Build context from retrieved documents
        context_parts = []
        sources = []
        
        for doc, score in relevant_docs:
            # Only use highly relevant documents (score < 0.5)
            if score < 0.5:
                context_parts.append(doc.page_content)
                sources.append({
                    "content": doc.page_content[:100] + "...",
                    "source": doc.metadata.get("source"),
                    "date": doc.metadata.get("date"),
                    "relevance_score": float(score)
                })
        
        context = "\n\n---\n\n".join(context_parts)
        
        # Generate answer grounded in retrieved context
        prompt = f"""Based on the following verified documents, answer the question.
        If the answer is not in the documents, say "I don't have that information."
        
        Documents:
        {context}
        
        Question: {question}
        
        Answer (cite sources):"""
        
        response = self.llm.invoke(prompt)
        
        return {
            "answer": response,
            "sources": sources,
            "confidence": len(sources) / 5  # Simple confidence metric
        }

MCP-Style Tools: Extending AI Capabilities Beyond Text

Model Context Protocol (MCP) helped me to build a flexible tool system. Instead of hardcoding every capability, we give the AI tools it can discover and use:

class BaseTool(ABC):
    """
    Every tool self-describes its capabilities.
    This is like giving the AI an instruction manual for each tool.
    """
    
    @abstractmethod
    def get_schema(self) -> ToolSchema:
        """Define what this tool does and how to use it"""
        pass
    
    @abstractmethod
    def execute(self, **kwargs) -> Any:
        """Actually run the tool"""
        pass

class StockDataTool(BaseTool):
    """
    Real example: This tool replaced my entire market data microservice
    """
    
    def get_schema(self) -> ToolSchema:
        return ToolSchema(
            name="stock_data",
            description="Fetch real-time stock market data including price, volume, and fundamentals",
            category=ToolCategory.DATA_RETRIEVAL,
            parameters=[
                ToolParameter(
                    name="ticker",
                    type="string", 
                    description="Stock symbol like AAPL or GOOGL",
                    required=True
                ),
                ToolParameter(
                    name="metrics",
                    type="array",
                    description="Specific metrics to retrieve",
                    required=False,
                    default=["price", "volume", "pe_ratio"],
                    enum=["price", "volume", "pe_ratio", "market_cap", 
                          "dividend_yield", "beta", "rsi", "moving_avg_50"]
                )
            ],
            returns="Dictionary containing requested stock metrics",
            examples=[
                {"ticker": "AAPL", "metrics": ["price", "pe_ratio"]},
                {"ticker": "TSLA", "metrics": ["price", "volume", "rsi"]}
            ]
        )
    
    def execute(self, **kwargs) -> Dict[str, Any]:
        """
        This connects to real market data APIs.
        In my old system, this was a 500-line service.
        """
        ticker = kwargs["ticker"].upper()
        metrics = kwargs.get("metrics", ["price", "volume"])
        
        # Using yfinance for real market data
        import yfinance as yf
        stock = yf.Ticker(ticker)
        info = stock.info
        
        result = {"ticker": ticker, "timestamp": datetime.now().isoformat()}
        
        # Fetch requested metrics
        metric_mapping = {
            "price": lambda: info.get("currentPrice", stock.history(period="1d")['Close'].iloc[-1]),
            "volume": lambda: info.get("volume", 0),
            "pe_ratio": lambda: info.get("trailingPE", 0),
            "market_cap": lambda: info.get("marketCap", 0),
            "dividend_yield": lambda: info.get("dividendYield", 0) * 100,
            "beta": lambda: info.get("beta", 1.0),
            "rsi": lambda: self._calculate_rsi(stock),
            "moving_avg_50": lambda: stock.history(period="50d")['Close'].mean()
        }
        
        for metric in metrics:
            if metric in metric_mapping:
                try:
                    result[metric] = metric_mapping[metric]()
                except Exception as e:
                    result[metric] = f"Error: {str(e)}"
        
        return result
class ToolParameter(BaseModel):
    """Actual parameter definition from project"""
    name: str
    type: str  # "string", "number", "boolean", "object", "array"
    description: str
    required: bool = True
    default: Any = None
    enum: Optional[List[Any]] = None

class CalculatorTool(BaseTool):
    """Actual calculator implementation from project"""
    
    def execute(self, **kwargs) -> float:
        """Safely evaluate mathematical expression"""
        self.validate_input(**kwargs)
        
        expression = kwargs["expression"]
        precision = kwargs.get("precision", 2)
        
        try:
            # Security: Remove dangerous operations
            safe_expr = expression.replace("__", "").replace("import", "")
            
            # Define allowed functions (from actual code)
            safe_dict = {
                "abs": abs, "round": round, "min": min, "max": max,
                "sum": sum, "pow": pow, "len": len
            }
            
            # Add math functions
            import math
            for name in ["sqrt", "log", "log10", "sin", "cos", "tan", "pi", "e"]:
                if hasattr(math, name):
                    safe_dict[name] = getattr(math, name)
            
            result = eval(safe_expr, {"__builtins__": {}}, safe_dict)
            
            return round(result, precision)
            
        except Exception as e:
            raise ValueError(f"Calculation error: {e}")

Orchestrating Everything with LangGraph

This is where all the pieces come together. LangGraph allows coordinating multiple agents and tools in sophisticated workflows:

class FinancialAnalysisWorkflow:
    """
    This workflow replaces what used to be multiple microservices,
    message queues, and orchestration layers. It's beautiful.
    """
    
    def _build_graph(self) -> StateGraph:
        """
        Define how different analysis components work together
        """
        workflow = StateGraph(AgentState)
        
        # Add all our analysis nodes
        workflow.add_node("collect_data", self.collect_market_data)
        workflow.add_node("technical_analysis", self.run_technical_analysis)
        workflow.add_node("fundamental_analysis", self.run_fundamental_analysis)
        workflow.add_node("sentiment_analysis", self.analyze_sentiment)
        workflow.add_node("options_analysis", self.analyze_options)
        workflow.add_node("portfolio_optimization", self.optimize_portfolio)
        workflow.add_node("rag_research", self.search_documents)
        workflow.add_node("react_reasoning", self.reason_about_data)
        workflow.add_node("generate_report", self.create_final_report)
        
        # Entry point
        workflow.set_entry_point("collect_data")
        
        # Define the flow - some parallel, some sequential
        workflow.add_edge("collect_data", "technical_analysis")
        workflow.add_edge("collect_data", "fundamental_analysis")
        workflow.add_edge("collect_data", "sentiment_analysis")
        
        # These can run in parallel
        workflow.add_conditional_edges(
            "collect_data",
            self.should_run_options,  # Only if options are relevant
            {
                "yes": "options_analysis",
                "no": "rag_research"
            }
        )
        
        # Everything feeds into reasoning
        workflow.add_edge(["technical_analysis", "fundamental_analysis", 
                          "sentiment_analysis", "options_analysis"], 
                          "react_reasoning")
        
        # Reasoning leads to report
        workflow.add_edge("react_reasoning", "generate_report")
        
        # End
        workflow.add_edge("generate_report", END)
        
        return workflow
    
    def analyze_stock_comprehensive(self, ticker: str, investment_amount: float = 10000):
        """
        This single function replaces what used to be an entire team's
        worth of manual analysis.
        """
        initial_state = {
            "ticker": ticker,
            "investment_amount": investment_amount,
            "timestamp": datetime.now(),
            "messages": [],
            "market_data": {},
            "technical_indicators": {},
            "fundamental_metrics": {},
            "sentiment_scores": {},
            "options_data": {},
            "portfolio_recommendation": {},
            "documents_retrieved": [],
            "reasoning_trace": [],
            "final_report": "",
            "errors": []
        }
        
        # Run the workflow
        try:
            result = self.app.invoke(initial_state)
            return self._format_comprehensive_report(result)
        except Exception as e:
            # Graceful degradation
            return self._run_basic_analysis(ticker, investment_amount)
class WorkflowNodes:
    """Collection of workflow nodes from actual project"""
    
    def collect_market_data(self, state: AgentState) -> AgentState:
        """Node: Collect market data using tools"""
        print("? Collecting market data...")
        
        ticker = state["ticker"]
        
        try:
            # Use actual stock data tool from project
            tool = self.tool_registry.get_tool("stock_data")
            market_data = tool.execute(
                ticker=ticker,
                metrics=["price", "volume", "market_cap", "pe_ratio", "52_week_high", "52_week_low"]
            )
            
            state["market_data"] = market_data
            
            # Add message to history
            state["messages"].append(
                AIMessage(content=f"Collected market data for {ticker}")
            )
            
        except Exception as e:
            state["error"] = f"Failed to collect market data: {str(e)}"
            state["market_data"] = {}
        
        return state

Here is a screenshot from the example showing workflow analysis:

Production Considerations: From Tutorial to Trading Floor

This tutorial demonstrates core concepts, but let me be clear – production deployment in financial services requires significantly more rigor. Having deployed similar systems in regulated environments, here’s what you’ll need to consider:

The Reality of Production Deployment

Production financial systems require months of parallel running and validation. In my experience, you’ll need:

class ProductionValidation:
    """
    Always run new systems parallel to existing ones
    """
    def validate_against_legacy(self, ticker: str):
        # Run both systems
        legacy_result = self.legacy_system.analyze(ticker)
        agent_result = self.agent_system.analyze(ticker)
        
        # Compare results
        discrepancies = self.compare_results(legacy_result, agent_result)
        
        # Log everything for audit
        self.audit_log.record({
            "ticker": ticker,
            "timestamp": datetime.now(),
            "legacy": legacy_result,
            "agent": agent_result,
            "discrepancies": discrepancies,
            "approved": len(discrepancies) == 0
        })
        
        # Require human review for discrepancies
        if discrepancies:
            return self.escalate_to_human(discrepancies)
        
        return agent_result

Integrating Traditional Financial Algorithms

While this tutorial uses general-purpose LLMs, production systems should combine AI with proven financial algorithms:

class HybridAnalyzer:
    """
    Combine traditional algorithms with AI reasoning
    """
    def analyze_options(self, ticker: str, strike: float, expiry: str):
        # Use traditional Black-Scholes for pricing
        traditional_price = self.black_scholes_pricer.calculate(
            ticker, strike, expiry
        )
        
        # Use AI for market context
        ai_context = self.agent.analyze_market_conditions(ticker)
        
        # Combine both
        if ai_context["volatility_regime"] == "high":
            # AI detected unusual conditions, adjust model
            adjusted_price = traditional_price * (1 + ai_context["vol_adjustment"])
            confidence = "low - unusual market conditions"
        else:
            adjusted_price = traditional_price
            confidence = "high - normal market conditions"
        
        return {
            "model_price": traditional_price,
            "adjusted_price": adjusted_price,
            "confidence": confidence,
            "reasoning": ai_context["reasoning"]
        }

Fitness Functions for Financial Accuracy

Financial data cannot tolerate hallucinations. Implement strict validation:

class FinancialFitnessValidator:
    """
    Reject hallucinated or impossible financial data
    """
    def validate_metrics(self, ticker: str, metrics: Dict):
        validations = {
            "pe_ratio": lambda x: -100 < x < 1000,
            "price": lambda x: x > 0,
            "market_cap": lambda x: x > 0,
            "dividend_yield": lambda x: 0 <= x <= 20,
            "revenue_growth": lambda x: -100 < x < 200
        }
        
        for metric, validator in validations.items():
            if metric in metrics:
                value = metrics[metric]
                if not validator(value):
                    raise ValueError(f"Invalid {metric}: {value} for {ticker}")
        
        # Cross-validation
        if "pe_ratio" in metrics and "earnings" in metrics:
            calculated_pe = metrics["price"] / metrics["earnings"]
            if abs(calculated_pe - metrics["pe_ratio"]) > 1:
                raise ValueError("P/E ratio doesn't match price/earnings")
        
        return True

Leverage Your Existing Data

If you have years of financial data in databases, you don’t need to start over. Use RAG to make it searchable:

# Convert your SQL database to vector-searchable documents
existing_data = sql_query("SELECT * FROM financial_reports")
rag_engine.add_documents([
    {"content": row.text, "metadata": {"date": row.date, "ticker": row.ticker}}
    for row in existing_data
])

Human-in-the-Loop

No matter how sophisticated your agents become, financial decisions affecting real money require human oversight. Build it in from day one:

  • Confidence thresholds that trigger human review
  • Clear audit trails showing agent reasoning
  • Easy override mechanisms
  • Gradual automation based on proven accuracy
class HumanInTheLoopWorkflow:
    """
    Ensure human review for critical decisions
    """
    def execute_trade_recommendation(self, recommendation: Dict):
        # Auto-approve only for low-risk, small trades
        if (recommendation["risk_score"] < 0.3 and 
            recommendation["amount"] < 10000):
            return self.execute(recommendation)
        
        # Require human approval for everything else
        approval_request = {
            "recommendation": recommendation,
            "agent_reasoning": recommendation["reasoning_trace"],
            "confidence": recommendation["confidence_score"],
            "risk_assessment": self.assess_risks(recommendation)
        }
        
        # Send to human reviewer
        human_decision = self.request_human_review(approval_request)
        
        if human_decision["approved"]:
            return self.execute(recommendation)
        else:
            self.log_rejection(human_decision["reason"])

Cost Management and Budget Controls

During development, Ollama gives you free local inference. In production, costs add up quickly so you need to build proper controls for calculating cost of analysis:

  • GPT-4: ~$30 per million tokens
  • Claude-3: ~$20 per million tokens
  • Local Llama: Free but needs GPU infrastructure
class CostController:
    """
    Prevent runway costs in production
    """
    def __init__(self, daily_budget: float = 100.0):
        self.daily_budget = daily_budget
        self.costs_today = 0.0
        self.cost_per_token = {
            "gpt-4": 0.00003,  # $0.03 per 1K tokens
            "claude-3": 0.00002,
            "llama-local": 0.0  # Free but has compute cost
        }
    
    def check_budget(self, estimated_tokens: int, model: str):
        estimated_cost = estimated_tokens * self.cost_per_token.get(model, 0)
        
        if self.costs_today + estimated_cost > self.daily_budget:
            # Switch to local model or cache
            return "use_local_model"
        
        return "proceed"
    
    def track_usage(self, tokens_used: int, model: str):
        cost = tokens_used * self.cost_per_token.get(model, 0)
        self.costs_today += cost
        
        # Alert if approaching limit
        if self.costs_today > self.daily_budget * 0.8:
            self.send_alert(f"80% of daily budget used: ${self.costs_today:.2f}")

Caching Is Essential

Caching is crucial for both performance and cost effectiveness when running expensive analysis using LLMs.

class CachedRAGEngine(RAGEngine):
    """
    Caching reduced our costs by 70% and improved response time by 5x
    """
    
    def __init__(self):
        super().__init__()
        self.cache = Redis(host='localhost', port=6379, db=0)
        self.cache_ttl = 3600  # 1 hour for financial data
    
    def retrieve_with_cache(self, query: str, k: int = 5):
        # Create cache key from query
        cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
        
        # Check cache first
        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # If not cached, retrieve and cache
        docs = self.vector_store.similarity_search(query, k=k)
        
        # Cache the results
        self.cache.setex(
            cache_key, 
            self.cache_ttl,
            json.dumps([doc.to_dict() for doc in docs])
        )
        
        return docs

Fallback Strategies

A Cascading Fallback can help execute a task using a sequence of operations, ordered from the most preferred (highest quality/cost) to the least preferred (lowest quality/safest default).

class ResilientAgent:
    """
    Production agents need multiple fallback options
    """
    
    def analyze_with_fallbacks(self, ticker: str):
        strategies = [
            ("primary", self.run_full_analysis),
            ("fallback_1", self.run_simplified_analysis),
            ("fallback_2", self.run_basic_analysis),
            ("emergency", self.return_cached_or_default)
        ]
        
        for strategy_name, strategy_func in strategies:
            try:
                result = strategy_func(ticker)
                result["strategy_used"] = strategy_name
                return result
            except Exception as e:
                logger.warning(f"Strategy {strategy_name} failed: {e}")
                continue
        
        return {"error": "All strategies failed", "ticker": ticker}

Observability and Monitoring

Track token usage, latency, accuracy, and costs immediately. What you don’t measure, you can’t improve.

class ObservableWorkflow:
    """
    You need to know what your AI is doing in production
    """
    
    def __init__(self):
        self.metrics = PrometheusMetrics()
        self.tracer = JaegerTracer()
    
    def execute_with_observability(self, state: AgentState):
        with self.tracer.start_span("workflow_execution") as span:
            span.set_tag("ticker", state["ticker"])
            
            # Track token usage
            tokens_start = self.llm.get_num_tokens(state)
            
            # Execute workflow
            result = self.workflow.invoke(state)
            
            # Record metrics
            tokens_used = self.llm.get_num_tokens(result) - tokens_start
            self.metrics.record_tokens(tokens_used)
            self.metrics.record_latency(span.duration)
            
            # Log for debugging
            logger.info(f"Workflow completed", extra={
                "ticker": state["ticker"],
                "tokens": tokens_used,
                "duration": span.duration,
                "strategy": result.get("strategy_used", "primary")
            })
            
            return result

Closing Thoughts

This tutorial demonstrates how agentic AI transforms financial analysis from rigid pipelines to adaptive, thinking systems. The combination of ReAct reasoning, RAG grounding, tool use, and workflow orchestration creates capabilities that surpass traditional approaches in flexibility and ease of development.

Start Simple, Build Incrementally:

  • Week 1: Basic ReAct agent to understand reasoning loops
  • Week 2: Add tools for external capabilities
  • Week 3: Implement RAG to ground responses in real data
  • Week 4: Orchestrate with workflows
  • Develop everything locally with Ollama first – it’s free and private

The point of agentic AI is automation. Here’s the pragmatic approach:

Automate in Tiers:

  • Tier 1 (Fully Automated): Data collection, technical calculations, report generation
  • Tier 2 (Auto + Audit): Sentiment analysis, risk scoring, anomaly detection
  • Tier 3 (Human Required): Large trades, strategy changes, regulatory decisions

Clear Escalation Rules:

ESCALATE_IF = {
    "confidence_below": 0.8,
    "amount_above": 100000,
    "regulatory_flag": True,
    "anomaly_detected": True
}

Reinforcement Learning:

Instead of permanent human-in-the-loop, use RL to train agents that learn from feedback:

class ReinforcementLearningLoop:
    """
    Gradually reduce human involvement through learning
    """

    def ai_based_reinforcement(self, decision, outcome):
        """AI learns from market outcomes directly"""
        # Did the prediction match reality?
        reward = self.calculate_reward(decision, outcome)

        if decision["action"] == "buy" and outcome["price_change"] > 0.02:
            reward = 1.0  # Good decision
        elif decision["action"] == "hold" and abs(outcome["price_change"]) < 0.01:
            reward = 0.5  # Correct to avoid volatility
        else:
            reward = -0.5  # Poor decision

        # Update agent weights/prompts based on reward
        self.agent.update_policy(decision["context"], reward)

    def human_feedback_learning(self, decision, human_override=None):
        """Learn from human corrections when they occur"""
        if human_override:
            # Human disagreed - strong learning signal
            self.agent.record_correction(
                agent_decision=decision,
                human_decision=human_override,
                weight=10.0  # Human feedback weighted heavily
            )
        else:
            # Human agreed (implicitly by not overriding)
            self.agent.reinforce_decision(decision, weight=1.0)

    def adaptive_automation_threshold(self):
        """Dynamically adjust when human review is needed"""
        recent_accuracy = self.get_recent_accuracy(days=30)

        if recent_accuracy > 0.95:
            self.confidence_threshold *= 0.9  # Require less human review
        elif recent_accuracy < 0.85:
            self.confidence_threshold *= 1.1  # Require more human review

        return self.confidence_threshold

This approach reduces human involvement over time: use that feedback to train, gradually automate decisions where the agent consistently agrees with humans, and only escalate novel situations or low-confidence decisions.


Complete code at github.com/bhatti/agentic-ai-tutorial. Start local, validate thoroughly, scale confidently.

October 15, 2025

Agentic AI for Automated PII Detection: Building Privacy Guardians with LangChain and Vertex AI

Filed under: Computing — admin @ 12:47 pm

Introduction

Over the years, I have seen countless data breaches leaking private personal data of customers. For example, Equifax exposed 147 million Americans’ SSNs and birth dates; Facebook leaked 533 million users’ personal details; Yahoo lost 3 billion accounts. This risk of leaking personal data is not unique to large companies but most companies play security chicken. They bet on luck that we haven’t been breached yet, so we must be fine. In many cases, companies don’t even know what PII they have, where it lives, or who can access it.

Unrestrained Production Access

Here’s what I have seen in most companies where I worked: DevOps teams with unrestricted access to production databases “for debugging.” Support engineers who can browse any customer’s SSN, medical records, or financial data. That contractor from six months ago who still has production credentials. Engineers who can query any table, any field, anytime. I’ve witnessed the consequences firsthand:

  • Customer service reps browsing financial data of large customers “out of curiosity”
  • APIs that return PII data without proper authorization policies
  • Devops or support receives permanent permissions to access production data instead of time-bound or customer specific based on the underlying issue
  • Engineers accidentally logging credit card numbers in plaintext

This violates OWASP’s principle of least privilege—grant only the minimum access necessary. But there’s an even worse problem: most companies can’t even identify which fields contain PII. They often don’t have policies on how to protect different kind of PII data based on risks.

The Scale Problem

In modern architectures, manual PII identification is impossible:

  • Hundreds of microservices, each with dozens of data models
  • Tens of thousands of API endpoints
  • Constant schema evolution as teams ship daily
  • Our single customer proto had 84 fields—multiply that by hundreds of services

Traditional approaches—manual reviews, compliance audits, security questionnaires—can’t keep up. By the time you’ve reviewed everything, the schemas have already changed.

Enter Agentic AI: From 0% to 92% PII Detection

I have been applying AI assistants and agents to solve complex problems for a while and I have been thinking about how can we automatically detect PII? Not just obvious fields like “ssn” or “credit_card_number,” but the subtle ones—employee IDs that could be cross-referenced. I then built an AI-powered system that uses LangChain, LangGraph, and Vertex AI to scan every proto definition, identify PII patterns, and classify sensitivity levels. Though iterative development, I went from:

  • 0% accuracy: Naive prompt (“find PII fields”)
  • 45% accuracy: Basic rules without specificity
  • 92%+ accuracy: Iterative prompt engineering with explicit field mappings

It’s not perfect, but it’s infinitely better than the nothing most companies have.

The Real Problem: It’s Not Just About Compliance

Let me share some uncomfortable truths about PII in modern systems:

The Public API Problem

We had list APIs returning customer data like this:

{
  "customers": [
    {
      "id": "cust_123",
      "name": "John Doe",
      "email": "john@example.com",
      "ssn": "123-45-6789",
      "date_of_birth": "1990-01-15",
      "credit_score": 750,
    }
  ]
}

Someone with the API access could list all customers and capture their private data like ssn and date_of_birth.

The Internal Access Problem

One recurring issue I found with internal access is giving carte blanche access (often permanent) to devops environment or production database for debugging. In other cases, support team needed customer data for tickets. But did they need to see following PII data for all customers:

  • Social Security Numbers?
  • Medical records?
  • Credit card numbers?
  • Salary information?

Of course not. I saw often the list APIs return this PII data for all customers or calling GetAccount gave you everything without proper authorization policies.

The Compliance Nightmare

The government regulations like GDPR, CCPA, HIPAA, PCI-DSS have been growing but each has different rules about what constitutes PII, how it should be protected, and what happens if you leak it. Manual compliance checking is impossible at scale.

The RBAC Isn’t Enough Problem

I’ve spent years building authorization systems, believing RBAC was the answer. I wrote about it in Building a Hybrid Authorization System for Granular Access Control and created multiple authorization solutions like:

  • PlexRBAC – A comprehensive RBAC library for Java/Scala with dynamic role hierarchies
  • PlexRBACJS – JavaScript implementation with fine-grained permissions
  • SaaS_RBAC – Multi-tenant RBAC with organization-level isolation

These systems can enforce incredibly sophisticated access controls. They can handle role inheritance, permission delegation, contextual access rules. But here’s what I learned the hard way: RBAC is useless if you don’t know what data needs protection. First, you need to identify PII. Then you can enforce field-level authorization.

The Solution: AI-Powered PII Detection with Proto Annotations

I built an Agentic AI based automation that:

  1. Automatically scans all proto definitions for PII
  2. Classifies sensitivity levels (HIGH, MEDIUM, LOW, PUBLIC)
  3. Generates appropriate annotations for enforcement
  4. Integrates with CI/CD to prevent PII leaks before deployment

Here’s what it looks like in action:

Before: Unmarked PII Everywhere

message Account {
  string id = 1;
  string first_name = 2;
  string ssn = 3;  // No indication this is sensitive!
  string email = 4;
  string credit_card_number = 5;  // Just sitting there, unprotected
  repeated string medical_conditions = 6;  // HIPAA violation waiting to happen
}

After: Fully Annotated with Sensitivity Levels

message Account {
  option (pii.v1.message_sensitivity) = HIGH;

  string id = 1 [
    (pii.v1.sensitivity) = LOW,
    (pii.v1.pii_type) = CUSTOMER_ID
  ];

  string first_name = 2 [
    (pii.v1.sensitivity) = LOW,
    (pii.v1.pii_type) = NAME
  ];

  string ssn = 3 [
    (pii.v1.sensitivity) = HIGH,
    (pii.v1.pii_type) = SSN
  ];

  string email = 4 [
    (pii.v1.sensitivity) = MEDIUM,
    (pii.v1.pii_type) = EMAIL_PERSONAL
  ];

  string credit_card_number = 5 [
    (pii.v1.sensitivity) = HIGH,
    (pii.v1.pii_type) = CREDIT_CARD
  ];

  repeated string medical_conditions = 6 [
    (pii.v1.sensitivity) = HIGH,
    (pii.v1.pii_type) = MEDICAL_RECORD
  ];
}

Now our authorization system knows exactly what to protect!

Architecture: How It All Works

The system uses a multi-stage pipeline combining LangChain, LangGraph, and Vertex AI:

Technical Implementation Deep Dive

1. The LangGraph State Machine

I used LangGraph to create a deterministic workflow for PII detection:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional, Dict, Any
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Field

class PiiDetectionState(TypedDict):
    """State for PII detection workflow"""
    proto_file: str
    proto_content: str
    parsed_proto: Dict[str, Any]
    llm_analysis: Optional[ProtoAnalysis]
    final_report: Optional[PiiDetectionReport]
    annotated_proto: Optional[str]
    errors: List[str]

class PiiDetector:
    def __init__(self, model_name: str = "gemini-2.0-flash-exp"):
        self.llm = ChatVertexAI(
            model_name=model_name,
            project=PROJECT_ID,
            location=LOCATION,
            temperature=0.1,  # Low temperature for consistent classification
            max_output_tokens=8192,
            request_timeout=120  # Handle large protos
        )
        self.workflow = self._create_workflow()

    def _create_workflow(self) -> StateGraph:
        """Create the LangGraph workflow"""
        workflow = StateGraph(PiiDetectionState)

        # Add nodes for each step
        workflow.add_node("parse_proto", self._parse_proto_node)
        workflow.add_node("analyze_pii", self._analyze_pii_node)
        workflow.add_node("generate_annotations", self._generate_annotations_node)
        workflow.add_node("create_report", self._create_report_node)

        # Define the flow
        workflow.set_entry_point("parse_proto")
        workflow.add_edge("parse_proto", "analyze_pii")
        workflow.add_edge("analyze_pii", "generate_annotations")
        workflow.add_edge("generate_annotations", "create_report")
        workflow.add_edge("create_report", END)

        return workflow.compile()

    async def _analyze_pii_node(self, state: PiiDetectionState) -> PiiDetectionState:
        """Analyze PII using LLM with retry logic"""
        max_retries = 3
        retry_delay = 2

        for attempt in range(max_retries):
            try:
                # Create structured output chain
                analysis_chain = self.llm.with_structured_output(ProtoAnalysis)

                # Create the analysis prompt
                prompt = self.create_pii_detection_prompt(state['parsed_proto'])

                # Get LLM analysis
                result = await analysis_chain.ainvoke(prompt)

                if result:
                    state['llm_analysis'] = result
                    return state

            except Exception as e:
                if attempt < max_retries - 1:
                    await asyncio.sleep(retry_delay)
                    continue
                else:
                    state['errors'].append(f"LLM analysis failed: {str(e)}")

        return state

2. Pydantic Models for Structured Output

I used Pydantic to ensure consistent, structured responses from the LLM:

class FieldAnalysis(BaseModel):
    """Analysis of a single proto field for PII"""
    field_name: str = Field(description="The name of the field")
    field_path: str = Field(description="Full path like Message.field")
    contains_pii: bool = Field(description="Whether field contains PII")
    sensitivity: str = Field(description="HIGH, MEDIUM, LOW, or PUBLIC")
    pii_type: Optional[str] = Field(default=None, description="Type of PII")
    reasoning: str = Field(description="Explanation for classification")

class MessageAnalysis(BaseModel):
    """Analysis of a proto message"""
    message_name: str = Field(description="Name of the message")
    overall_sensitivity: str = Field(description="Highest sensitivity in message")
    fields: List[FieldAnalysis] = Field(description="Analysis of each field")

class ProtoAnalysis(BaseModel):
    """Complete analysis of a proto file"""
    messages: List[MessageAnalysis] = Field(description="All analyzed messages")
    services: List[ServiceAnalysis] = Field(default_factory=list)
    summary: AnalysisSummary = Field(description="Overall statistics")

3. The Critical Prompt Engineering

I found that the key to accurate PII detection is in the prompt. Here’s a battle-tested prompt that achieves 92%+ accuracy after many trial and errors:

def create_pii_detection_prompt(self) -> str:
    """Create the prompt for PII detection"""
    return """You are an expert in data privacy and PII detection.
    Analyze the Protocol Buffer definition and identify ALL fields that contain PII.

    STRICT Classification Rules - YOU MUST FOLLOW THESE EXACTLY:

    1. HIGH Sensitivity (MAXIMUM PROTECTION REQUIRED):
       ALWAYS classify these field names as HIGH:
       - ssn, social_security_number ? HIGH + SSN
       - tax_id, tin ? HIGH + TAX_ID
       - passport_number, passport ? HIGH + PASSPORT
       - drivers_license, driving_license ? HIGH + DRIVERS_LICENSE
       - bank_account_number ? HIGH + BANK_ACCOUNT
       - credit_card_number ? HIGH + CREDIT_CARD
       - credit_card_cvv ? HIGH + CREDIT_CARD
       - medical_record_number ? HIGH + MEDICAL_RECORD
       - health_insurance_id ? HIGH + HEALTH_INSURANCE
       - medical_conditions ? HIGH + MEDICAL_RECORD
       - prescriptions ? HIGH + MEDICAL_RECORD
       - password_hash, password ? HIGH + PASSWORD
       - api_key ? HIGH + API_KEY
       - salary, annual_income ? HIGH + null

    2. MEDIUM Sensitivity:
       - email, personal_email ? MEDIUM + EMAIL_PERSONAL
       - phone, mobile_phone ? MEDIUM + PHONE_PERSONAL
       - home_address ? MEDIUM + ADDRESS_HOME
       - date_of_birth, dob ? MEDIUM + DATE_OF_BIRTH
       - username ? MEDIUM + USERNAME
       - ip_address ? MEDIUM + IP_ADDRESS
       - device_id ? MEDIUM + DEVICE_ID
       - geolocation (latitude, longitude) ? MEDIUM + null

    3. LOW Sensitivity:
       - first_name, last_name, middle_name ? LOW + NAME
       - gender ? LOW + GENDER
       - work_email ? LOW + EMAIL_WORK
       - work_phone ? LOW + PHONE_WORK
       - job_title ? LOW + null
       - employer_name ? LOW + null

    4. PUBLIC (non-PII):
       - id (if system-generated)
       - status, created_at, updated_at
       - counts, totals, metrics

    IMPORTANT: Analyze EVERY SINGLE FIELD. Do not skip any.
    """

3. Handling the Gotchas

During development, I faced several challenges that required creative solutions:

Challenge 1: Multi-line Proto Annotations

Proto files often have annotations spanning multiple lines:

string ssn = 3 [
    (pii.v1.sensitivity) = HIGH,
    (pii.v1.pii_type) = SSN
];

Solution: Parse with look-ahead:

def extract_annotations(self, lines: List[str]) -> Dict:
    i = 0
    while i < len(lines):
        if '[' in lines[i]:
            # Collect until we find ']'
            annotation_text = lines[i]
            j = i + 1
            while j < len(lines) and '];' not in annotation_text:
                annotation_text += ' ' + lines[j]
                j += 1
            # Now parse the complete annotation
            self.parse_annotation(annotation_text)
            i = j
        else:
            i += 1

Challenge 2: Context-Dependent Classification

A field named id could be:

  • PUBLIC if it’s a system-generated UUID
  • LOW if it’s a customer ID that could be used for lookups
  • MEDIUM if it’s an employee ID with PII implications

Solution: Consider the message context:

def classify_with_context(self, field_name: str, message_name: str) -> str:
    if message_name in ['Customer', 'User', 'Account']:
        if field_name == 'id':
            return 'LOW'  # Customer ID has some sensitivity
    elif message_name in ['System', 'Config']:
        if field_name == 'id':
            return 'PUBLIC'  # System IDs are not PII
    return self.default_classification(field_name)

Challenge 3: Handling Nested Messages and Maps

Real protos have complex structures:

message Account {
    map<string, string> metadata = 100;  // Could contain anything!
    repeated Address addresses = 101;
    Location last_location = 102;
}

Solution: Recursive analysis with inheritance:

def analyze_field(self, field: Field, parent_sensitivity: str = 'PUBLIC'):
    if field.type == 'map':
        # Maps could contain PII
        return 'MEDIUM' if parent_sensitivity != 'HIGH' else 'HIGH'
    elif field.is_message:
        # Analyze the referenced message
        message_sensitivity = self.analyze_message(field.message_type)
        return max(parent_sensitivity, message_sensitivity)
    else:
        return self.classify_field(field.name)

Real-World Testing

I tested the system on a test customer account proto with 84 fields. Here’s what happened:

Before: Original Proto Without Annotations

syntax = "proto3";

package pii.v1;

// Account represents a user account - NO PII ANNOTATIONS
message Account {
    // System fields
    string id = 1;
    string account_number = 2;
    AccountStatus status = 3;
    google.protobuf.Timestamp created_at = 4;
    google.protobuf.Timestamp updated_at = 5;

    // Personal information - UNPROTECTED PII!
    string first_name = 10;
    string last_name = 11;
    string middle_name = 12;
    string date_of_birth = 13;  // Format: YYYY-MM-DD
    string gender = 14;

    // Contact information - MORE UNPROTECTED PII!
    string email = 20;
    string personal_email = 21;
    string work_email = 22;
    string phone = 23;
    string mobile_phone = 24;
    string work_phone = 25;

    // Government IDs - CRITICAL PII EXPOSED!
    string ssn = 40;
    string tax_id = 41;
    string passport_number = 42;
    string drivers_license = 43;
    string national_id = 44;

    // Financial information - HIGHLY SENSITIVE!
    string bank_account_number = 50;
    string routing_number = 51;
    string credit_card_number = 52;
    string credit_card_cvv = 53;
    string credit_card_expiry = 54;
    double annual_income = 55;
    int32 credit_score = 56;

    // Medical information - HIPAA PROTECTED!
    string medical_record_number = 70;
    string health_insurance_id = 71;
    repeated string medical_conditions = 72;
    repeated string prescriptions = 73;

    // Authentication - SECURITY CRITICAL!
    string username = 80;
    string password_hash = 81;
    string security_question = 82;
    string security_answer = 83;
    string api_key = 84;
    string access_token = 85;

    // Device information
    string ip_address = 90;
    string device_id = 91;
    string user_agent = 92;
    Location last_location = 93;

    // Additional fields
    map<string, string> metadata = 100;
    repeated string tags = 101;
}

service AccountService {
    // All methods exposed without sensitivity annotations!
    rpc CreateAccount(CreateAccountRequest) returns (Account);
    rpc GetAccount(GetAccountRequest) returns (Account);
    rpc UpdateAccount(UpdateAccountRequest) returns (Account);
    rpc DeleteAccount(DeleteAccountRequest) returns (google.protobuf.Empty);
    rpc ListAccounts(ListAccountsRequest) returns (ListAccountsResponse);
    rpc SearchAccounts(SearchAccountsRequest) returns (SearchAccountsResponse);
}

After: AI-Generated Annotations (92.3% Accuracy!)

syntax = "proto3";

import "api/proto/pii/v1/sensitivity.proto";

// Account represents a user account - FULLY ANNOTATED WITH PII SENSITIVITY
message Account {
    option (pii.v1.message_sensitivity) = HIGH;

    // System fields
    string id = 1 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = CUSTOMER_ID];
    string account_number = 2 [(pii.v1.sensitivity) = MEDIUM];
    AccountStatus status = 3;  // Enum - no PII
    google.protobuf.Timestamp created_at = 4;  // PUBLIC
    google.protobuf.Timestamp updated_at = 5;  // PUBLIC

    // Personal information - PROPERLY CLASSIFIED
    string first_name = 10 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = NAME];
    string last_name = 11 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = NAME];
    string middle_name = 12 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = NAME];
    string date_of_birth = 13 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = DATE_OF_BIRTH];
    string gender = 14 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = GENDER];

    // Contact information - MEDIUM SENSITIVITY
    string email = 20 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = EMAIL_PERSONAL];
    string personal_email = 21 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = EMAIL_PERSONAL];
    string work_email = 22 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = EMAIL_WORK];
    string phone = 23 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = PHONE_PERSONAL];
    string mobile_phone = 24 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = PHONE_PERSONAL];
    string work_phone = 25 [(pii.v1.sensitivity) = LOW, (pii.v1.pii_type) = PHONE_WORK];

    // Government IDs - ALL HIGH SENSITIVITY ?
    string ssn = 40 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = SSN];
    string tax_id = 41 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = TAX_ID];
    string passport_number = 42 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSPORT];
    string drivers_license = 43 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = DRIVERS_LICENSE];
    string national_id = 44 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = NATIONAL_ID];

    // Financial information - ALL HIGH SENSITIVITY ?
    string bank_account_number = 50 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = BANK_ACCOUNT];
    string routing_number = 51 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = ROUTING_NUMBER];
    string credit_card_number = 52 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = CREDIT_CARD];
    string credit_card_cvv = 53 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = CREDIT_CARD];
    string credit_card_expiry = 54 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = CREDIT_CARD];
    double annual_income = 55 [(pii.v1.sensitivity) = HIGH];
    int32 credit_score = 56 [(pii.v1.sensitivity) = HIGH];

    // Medical information - ALL HIGH SENSITIVITY ?
    string medical_record_number = 70 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = MEDICAL_RECORD];
    string health_insurance_id = 71 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = HEALTH_INSURANCE];
    repeated string medical_conditions = 72 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = MEDICAL_RECORD];
    repeated string prescriptions = 73 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = MEDICAL_RECORD];

    // Authentication - ALL HIGH SENSITIVITY ?
    string username = 80 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = USERNAME];
    string password_hash = 81 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSWORD];
    string security_question = 82 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSWORD];
    string security_answer = 83 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = PASSWORD];
    string api_key = 84 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = API_KEY];
    string access_token = 85 [(pii.v1.sensitivity) = HIGH, (pii.v1.pii_type) = API_KEY];

    // Device information - MEDIUM SENSITIVITY
    string ip_address = 90 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = IP_ADDRESS];
    string device_id = 91 [(pii.v1.sensitivity) = MEDIUM, (pii.v1.pii_type) = DEVICE_ID];
    string user_agent = 92 [(pii.v1.sensitivity) = LOW];
    Location last_location = 93;  // Location message handled separately

    // Additional fields
    map<string, string> metadata = 100 [(pii.v1.sensitivity) = MEDIUM];
    repeated string tags = 101;  // PUBLIC
}

// Service methods also get sensitivity annotations
service AccountService {
    rpc CreateAccount(CreateAccountRequest) returns (Account) {
        option (pii.v1.method_sensitivity) = HIGH;
        option (pii.v1.audit_pii_access) = true;
    }

    rpc GetAccount(GetAccountRequest) returns (Account) {
        option (pii.v1.method_sensitivity) = HIGH;
        option (pii.v1.audit_pii_access) = true;
    }

    // ... all methods properly annotated
}

Results: 92.3% Accuracy!

Here’s the actual output from our final test run:

Testing PII detection on: ../api/proto/pii/v1/account_without_annotations.proto
================================================================================

================================================================================
PII DETECTION REPORT
================================================================================
Total Fields Analyzed: 84
PII Fields Detected: 57
Non-PII Fields: 27

Fields by Sensitivity Level:
  HIGH: 22 fields
  MEDIUM: 22 fields
  LOW: 13 fields
  PUBLIC: 27 fields

HIGH Sensitivity Fields (22):
  • Account.ssn ? SSN
  • Account.tax_id ? TAX_ID
  • Account.passport_number ? PASSPORT
  • Account.drivers_license ? DRIVERS_LICENSE
  • Account.national_id ? NATIONAL_ID
  • Account.bank_account_number ? BANK_ACCOUNT
  • Account.routing_number ? ROUTING_NUMBER
  • Account.credit_card_number ? CREDIT_CARD
  • Account.credit_card_cvv ? CREDIT_CARD
  • Account.annual_income ? null
  • Account.credit_score ? null
  • Account.salary ? null
  • Account.medical_record_number ? MEDICAL_RECORD
  • Account.health_insurance_id ? HEALTH_INSURANCE
  • Account.medical_conditions ? MEDICAL_RECORD
  • Account.prescriptions ? MEDICAL_RECORD
  • Account.password_hash ? PASSWORD
  • Account.security_question ? PASSWORD
  • Account.security_answer ? PASSWORD
  • Account.api_key ? API_KEY
  • Account.access_token ? API_KEY
  • CreateAccountRequest.account ? null

[Additional fields by sensitivity level...]

================================================================================

Annotated proto saved to: output/account_with_detected_annotations.proto

================================================================================
VERIFICATION: Comparing with Reference Implementation
================================================================================

Field Annotations:
  ? Correct: 60
  ? Incorrect: 5
  ??  Missing: 0
  ? Extra: 0

Message Annotations:
  ? Correct: 8
  ? Incorrect: 0
  ??  Missing: 1

Method Annotations:
  ? Correct: 0
  ? Incorrect: 6
  ??  Missing: 0

Overall Field Accuracy: 92.3%
? VERIFICATION PASSED (>=80% accuracy)

Note: The LLM may classify some fields differently based on context.

================================================================================
SUMMARY
================================================================================
Total fields analyzed: 84
PII fields detected: 57

Fields by sensitivity level:
  HIGH: 22 fields
  MEDIUM: 22 fields
  LOW: 13 fields
  PUBLIC: 27 fields

Test completed successfully!

The system correctly identified:

  • ? 100% of HIGH sensitivity fields (SSNs, credit cards, medical records)
  • ? 95% of MEDIUM sensitivity fields (personal emails, phone numbers, addresses)
  • ? 85% of LOW sensitivity fields (names, work emails, job titles)
  • ? 100% of PUBLIC fields (IDs, timestamps, enums)

Why 92.3% Accuracy Matters

  1. Perfect HIGH Sensitivity Detection: The system caught 100% of the most critical PII – SSNs, credit cards, medical records. These are the fields that can destroy lives if leaked.
  2. Conservative Classification: When uncertain, the system errs on the side of caution. It’s better to over-protect a field than to expose PII.
  3. Human Review Still Needed: The 8% difference is where human expertise adds value. The AI does the heavy lifting, humans do the fine-tuning.
  4. Continuous Improvement: Every correction teaches the system. Our accuracy improved from 0% to 45% to 92% through iterative refinement.

Integration with Field-Level Authorization

I also built a prototype for enforcing field-level authorization and masking PII data outside this project but here is a general approach for enforcement of PII protection policies and masking response fields:

Step 1: Generate Authorization Rules

def generate_authz_rules(proto_with_annotations: str) -> Dict:
    """Generate authorization rules from annotated proto"""
    rules = {}

    for field in parse_annotated_proto(proto_with_annotations):
        if field.sensitivity == 'HIGH':
            rules[field.path] = {
                'required_roles': ['admin', 'compliance_officer'],
                'required_scopes': ['pii.high.read'],
                'audit': True,
                'mask_in_logs': True
            }
        elif field.sensitivity == 'MEDIUM':
            rules[field.path] = {
                'required_roles': ['support', 'admin'],
                'required_scopes': ['pii.medium.read'],
                'audit': True,
                'mask_in_logs': False
            }

    return rules

Step 2: Runtime Enforcement

// In your gRPC interceptor
func (i *AuthzInterceptor) UnaryInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    // Get user's roles and scopes
    user := auth.UserFromContext(ctx)

    // Check field-level permissions
    response, err := handler(ctx, req)
    if err != nil {
        return nil, err
    }

    // Filter response based on PII annotations
    filtered := i.filterResponse(response, user)

    return filtered, nil
}

func (i *AuthzInterceptor) filterResponse(
    response interface{},
    user *auth.User,
) interface{} {
    // Use reflection to check each field's annotation
    v := reflect.ValueOf(response)
    for i := 0; i < v.NumField(); i++ {
        field := v.Type().Field(i)

        // Get PII annotation from proto
        sensitivity := getPIISensitivity(field)

        // Check if user has permission
        if !user.HasPermission(sensitivity) {
            // Mask or remove the field
            v.Field(i).Set(reflect.Zero(field.Type))
        }
    }

    return response
}

Step 3: The Magic Moment

Here is an example response from an API with PII data that enforces proper PII data protection:

// Before: Everything exposed
{
  "customer": {
    "name": "John Doe",
    "ssn": "123-45-6789",  // They see this!
    "credit_card": "4111-1111-1111-1111"  // And this!
  }
}

// After: Field-level filtering based on PII annotations
{
  "customer": {
    "name": "John Doe",
    "ssn": "[REDACTED]",  // Protected!
    "credit_card": "[REDACTED]"  // Protected!
  }
}

CI/CD Integration: Catching PII Before Production

This tool can be easily integrated with CI/CD pipelines to identify PII data if proper annotations are missing:

# .github/workflows/pii-detection.yml
name: PII Detection Check

on:
  pull_request:
    paths:
      - '**/*.proto'

jobs:
  detect-pii:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r check-pii-automation/requirements.txt

      - name: Detect PII in Proto Files
        env:
          GCP_PROJECT: ${{ secrets.GCP_PROJECT }}
        run: |
          cd check-pii-automation

          # Scan all proto files
          for proto in $(find ../api/proto -name "*.proto"); do
            echo "Scanning $proto"
            python pii_detector.py "$proto" \
              --output "output/$(basename $proto)" \
              --json "output/$(basename $proto .proto).json"
          done

      - name: Check for Unannotated PII
        run: |
          # Fail if HIGH sensitivity PII found without annotations
          for report in check-pii-automation/output/*.json; do
            high_pii=$(jq '.fields[] | select(.sensitivity == "HIGH" and .annotated == false)' $report)
            if [ ! -z "$high_pii" ]; then
              echo "? ERROR: Unannotated HIGH sensitivity PII detected!"
              echo "$high_pii"
              exit 1
            fi
          done

      - name: Generate Security Report
        if: always()
        run: |
          python check-pii-automation/generate_security_report.py \
            --input output/ \
            --output security_report.md

      - name: Comment on PR
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const report = fs.readFileSync('security_report.md', 'utf8');

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: report
            });

Advanced Features: Learning and Adapting

1. Custom PII Patterns

As, every organization has unique PII, we can support custom patterns:

# custom_pii_rules.yaml
custom_patterns:
  - name: "employee_badge_number"
    pattern: "badge_.*|.*_badge_id"
    sensitivity: "MEDIUM"
    pii_type: "EMPLOYEE_ID"

  - name: "internal_customer_reference"
    pattern: "cust_ref_.*|customer_reference"
    sensitivity: "LOW"
    pii_type: "CUSTOMER_ID"

  - name: "biometric_data"
    pattern: "fingerprint.*|face_.*|retina_.*"
    sensitivity: "HIGH"
    pii_type: "BIOMETRIC"

2. Context-Aware Classification

We can also learn from the codebase:

class ContextAwarePiiDetector:
    def __init__(self):
        self.context_rules = self.learn_from_codebase()

    def learn_from_codebase(self):
        """Learn patterns from existing annotated protos"""
        patterns = {}

        # Scan all existing annotated protos
        for proto_file in glob.glob("**/*.proto"):
            annotations = self.extract_annotations(proto_file)

            for field, annotation in annotations.items():
                # Learn the pattern
                if field not in patterns:
                    patterns[field] = []
                patterns[field].append({
                    'context': self.get_message_context(field),
                    'sensitivity': annotation['sensitivity']
                })

        return patterns

    def classify_with_learned_context(self, field_name: str, context: str):
        """Use learned patterns for classification"""
        if field_name in self.context_rules:
            # Find similar contexts
            for rule in self.context_rules[field_name]:
                if self.context_similarity(context, rule['context']) > 0.8:
                    return rule['sensitivity']

        return self.default_classification(field_name)

3. Incremental Learning from Corrections

Also, we can apply a RLHF (Reinforcement learning from human feedback) based mechanism to learn from human corrects a classification:

def record_correction(self, field: str, ai_classification: str, human_correction: str):
    """Learn from human corrections"""
    correction_record = {
        'field': field,
        'ai_said': ai_classification,
        'human_said': human_correction,
        'context': self.get_full_context(field),
        'timestamp': datetime.now()
    }

    # Store in vector database for RAG
    self.knowledge_base.add_correction(correction_record)

    # Update prompt if pattern emerges
    if self.count_similar_corrections(field) > 3:
        self.update_classification_rules(field, human_correction)

Results: What We Achieved

Before the System

  • Hours of manual review for each proto change
  • No systematic way to track PII across services
  • Compliance audits were nightmares

After Implementation

  • Automated detection in under 30 seconds
  • Complete PII inventory across all services
  • Compliance reports generated automatically
  • 92%+ accuracy in classification

Performance Optimization: From 0% to 92%

Above journey to 92% accuracy wasn’t straightforward. Here’s how it was improved:

Iteration 1: Generic Prompt (0% Accuracy)

# Initial naive approach
prompt = "Find PII fields in this proto and classify their sensitivity"
# Result: LLM returned None or generic responses

Iteration 2: Basic Rules (45% Accuracy)

# Added basic rules but not specific enough
prompt = """
Classify fields as:
- HIGH: Very sensitive data
- MEDIUM: Somewhat sensitive
- LOW: Less sensitive
"""
# Result: Everything classified as MEDIUM

Iteration 3: Explicit Field Mapping (92% Accuracy)

# The breakthrough: explicit field name patterns
prompt = """
STRICT Classification Rules - YOU MUST FOLLOW THESE EXACTLY:

1. HIGH Sensitivity:
   ALWAYS classify these field names as HIGH:
   - ssn, social_security_number ? HIGH + SSN
   - credit_card_number ? HIGH + CREDIT_CARD
   [... explicit mappings ...]
"""
# Result: 92.3% accuracy!

Key Performance Improvements

  1. Retry Logic with Exponential Backoff
   for attempt in range(max_retries):
       try:
           result = await self.llm.ainvoke(prompt)
           if result:
               return result
       except RateLimitError:
           delay = 2 ** attempt  # 2, 4, 8 seconds
           await asyncio.sleep(delay)
  1. Request Batching for Multiple Files
   async def batch_process(proto_files: List[Path]):
       # Process in batches of 5 to avoid rate limits
       batch_size = 5
       for i in range(0, len(proto_files), batch_size):
           batch = proto_files[i:i+batch_size]
           tasks = [detect_pii(f) for f in batch]
           results = await asyncio.gather(*tasks)
           # Add delay between batches
           await asyncio.sleep(2)
  1. Caching for Development
   @lru_cache(maxsize=100)
   def get_cached_analysis(proto_hash: str):
       # Cache results during development/testing
       return previous_analysis

Lessons Learned: The Hard Way

1. Start with High-Value PII

Don’t try to classify everything at once. Start with:

  • Government IDs (SSN, passport)
  • Financial data (credit cards, bank accounts)
  • Medical information
  • Authentication credentials

Get these right first, then expand.

2. False Positives Are Better Than False Negatives

We tuned for high recall (catching all PII) over precision. Why? It’s better to over-classify a field as sensitive than to leak an SSN.

3. Context Matters More Than Field Names

A field called data could be anything. Look at:

  • The message it’s in
  • Surrounding fields
  • Comments in the proto
  • How it’s used in code

4. Make Annotations Actionable

Don’t just mark fields as “sensitive”. Specify:

  • Exact sensitivity level (HIGH/MEDIUM/LOW)
  • PII type (SSN, CREDIT_CARD, etc.)
  • Required protections (encryption, masking, audit)

5. Integrate Early in Development

The best time to annotate PII is when the field is created, not after it’s in production. Make PII detection part of proto creation and API review process.

Getting Started

Here is how you can start with protecting your customers’ data:

Step 1: Install and Configure

# Clone the repository
git clone https://github.com/bhatti/todo-api-errors.git
cd todo-api-errors/check-pii-automation

# Set up Python environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Configure GCP
export GCP_PROJECT=your-project-id
export GCP_REGION=us-central1

# Authenticate with Google Cloud
gcloud auth application-default login

Step 2: Run Your First Scan

# Scan a proto file
python pii_detector.py path/to/your/file.proto \
  --output annotated.proto \
  --json report.json

# Review the report
cat report.json | jq '.fields[] | select(.sensitivity == "HIGH")'

Step 3: Real-World Example

Here’s a complete example using our test proto:

# 1. Scan the proto without annotations
python pii_detector.py ../api/proto/pii/v1/account_without_annotations.proto \
  --output output/account_annotated.proto \
  --json output/report.json

# 2. View the detection summary
echo "=== PII Detection Summary ==="
cat output/report.json | jq '{
  total_fields: .total_fields,
  pii_detected: .pii_fields,
  high_sensitivity: [.fields[] | select(.sensitivity == "HIGH") | .field_path],
  accuracy: "\(.pii_fields) / \(.total_fields) = \((.pii_fields / .total_fields * 100 | floor))%"
}'

# 3. Compare with reference implementation
python test_pii_detection.py

# 4. View the annotated proto
head -50 output/account_annotated.proto

Expected output:

=== PII Detection Summary ===
{
  "total_fields": 84,
  "pii_detected": 57,
  "high_sensitivity": [
    "Account.ssn",
    "Account.tax_id",
    "Account.credit_card_number",
    "Account.medical_record_number",
    "Account.password_hash"
  ],
  "accuracy": "57 / 84 = 67%"
}

Verification Results:
? Correct Classifications: 60
Overall Accuracy: 92.3%

Step 4: Integrate with CI/CD

Add the GitHub Action above to your repository. Start with warnings, then move to blocking deployments.

Step 5: Implement Field-Level Authorization

Use the annotations to enforce access control in your services. Start with the highest sensitivity fields.

Step 6: Monitor and Improve

Track false positives/negatives. Update custom rules. Share learnings with your team.

Conclusion: Privacy as Code

I have learned that manual API reviews are insufficient to evaluate risks of sensitive field when dealing with hundreds of services. Also, this responsibility can’t all be delegated to developers as it requires collaboration and feedback from security, legal and product teams. We need tooling and automated processes that understand and protect PII automatically. Every new field, every API change, every refactor is a chance for PII to leak. But with AI-powered detection, we can make privacy protection as automatic as running tests. The system we built isn’t perfect – 92% accuracy means we still miss 8% of PII. But it’s infinitely better than the 0% we were catching before.

The code is at https://github.com/bhatti/todo-api-errors. Star it, fork it, break it, improve it.

Resources and References

When Copying Kills Innovation: My Journey Through Software’s Cargo Cult Problem

Filed under: Computing — admin @ 11:35 am

Back in 1974, physicist Richard Feynman gave a graduation speech at Caltech about something he called “cargo cult science.” He told a story about islanders in the South Pacific who, after World War II, built fake airstrips and control towers out of bamboo. They’d seen cargo planes land during the war and figured if they recreated what they saw—runways, headsets, wooden antennas—the planes would come back with supplies. They copied the appearance but missed the substance. The planes never came. Feynman used this to describe bad research—studies that look scientific on the surface but lack real rigor. Researchers going through the motions without understanding what makes science actually work.

Software engineering does the exact same thing. I’ve been doing this long enough to see the pattern repeat everywhere: teams adopt tools and practices because that’s what successful companies use, without asking if it makes sense for them. Google uses monorepos? We need a monorepo. Amazon uses microservices? Time to split our monolith. Kubernetes is what “real” companies use? Better start writing YAML.

In my previous post, I wrote about how layers of abstraction have made software too complex. This post is about a related problem: we’re not just dealing with necessary complexity—we’re making things worse by cargo culting what other companies do. We build the bamboo control towers and wonder why the planes don’t land. This is cargo cult software development, and I am sharing what I’ve learned here.

Executive Stack Envy

Executives suffer from massive stack envy. The executive reads about scalability of Kafka so suddenly we need Kafka. Never mind that we already have RabbitMQ and IBM MQSeries running just fine. Then another executive decides Google Pub/Sub is “more cloud native.” Now we have four message queues. Nobody provides guidance on how to use any of them. I watched teams struggle with poisonous messages for weeks. They’d never heard of dead letter queues.

On the database side, it’s the same pattern. In the early 2000s, I saw everyone rushed to adopt object-oriented databases like Versant and ObjectStore but they were proved to be short lived. At one company, leadership bet everything on a graph database. When customers came, scalability collapsed. We spent the next six years migrating away—not because migration was inherently hard, but because engineers built an overly complex migration architecture. Classic pattern: complexity for promotion, not for solving problems.

Meanwhile, at another company: we already had CloudSQL. Some teams moved to AlloyDB. Then an executive discovered Google Spanner. Now we have three databases. Nobody can explain why. Nobody knows which service uses which. At one company, we spent five years upgrading everything to gRPC. Created 500+ services. Nobody performance tested any of it until a large customer signed up. That’s when we discovered the overhead—gRPC serialization, microservice hops, network calls—it all compounded.

The Sales Fiction

Sales promised four nines availability, sub-100ms latency, multi-region DR. “Netflix-like reliability.” Reality? Some teams couldn’t properly scale within a single region. The DR plan was a wiki page nobody tested. Nobody understood the dependencies.

The Complexity Tax

Every service needs monitoring, logging, deployment pipelines, load balancing, service mesh config. Every network call adds latency and failure modes. Every distributed transaction risks inconsistency [How Abstraction is Killing Software: A 30-Year Journey Through Complexity].

The Monorepo That Ate Our Productivity

At one company, leadership decided we needed a monorepo “because Google uses one.” They’d read about how Google Chrome’s massive codebase benefited from having all dependencies in one place. What they missed was that Google has hundreds of engineers dedicated solely to tooling support.

Our reality? All services—different languages, different teams—got crammed into one repository. The promise was better code sharing. The result was forced dependency alignment that broke builds constantly. A simple package update in one service would cascade failures across unrelated services. Build times ballooned to over an hour and engineers spent endless hours fighting the build system.

The real kicker: most of our services only needed to communicate through APIs. We could have used service interfaces, but instead we created compile-time dependencies where none should have existed. At my time at Amazon, we handled shared code with live version dependencies that would trigger builds only when actually affected. There are alternatives—we just didn’t explore them.

Blaze Builds and the Complexity Tax

The same organization then adopted Bazel (Google’s open-sourced version of Blaze). Again, the reasoning was “Google uses it, so it must be good.” Nobody asked whether our small engineering team needed the same build system as Google’s tens of thousands of engineers. Nobody calculated the learning curve cost. Nobody questioned whether our relatively simple microservices needed this level of build sophistication. The complexity tax was immediate and brutal. New engineers took weeks to understand the build system. Simple tasks became complicated. Debugging build failures required specialized knowledge that only a few people possessed. We’d traded a problem we didn’t have for a problem we couldn’t solve.

The Agile Cargo Cult

I’ve watched dozens of companies claim they’re “doing Agile” while missing every principle that makes Agile work. They hold standups, run sprints, track velocity—all the visible rituals. The results? Same problems as before, now with more meetings.

Standups That Aren’t

At one company, “daily standups” lasted 30 minutes. Each developer gave a detailed status report to their manager. Everyone else mentally checked out waiting their turn. Nobody coordinated. It was a status meeting wearing an Agile costume.

The Velocity Obsession

Another place tracked velocity religiously. Management expected consistent story points every sprint. When velocity dropped, teams faced uncomfortable questions about “productivity.” Solution? Inflate estimates. Break large stories into tiny ones. The velocity chart looked great. The actual delivery? Garbage. Research shows teams game metrics when measured on internal numbers instead of customer value.

Product Owners Who Aren’t

I’ve seen “Product Owners” who were actually project managers in disguise. They translated business requirements into user stories. Never talked to customers. Couldn’t make product decisions. Spent their time tracking progress and managing stakeholders. Without real product ownership, teams build features nobody needs. The Agile ceremony continues, the product fails.

Copying Without Understanding

The pattern is always the same: read about Spotify’s squads and tribes, implement the structure, wonder why it doesn’t work. They copied the org chart but missed the culture of autonomy, the customer focus, the experimental mindset. Or they send everyone to a two-day Scrum certification. Teams return with a checklist of activities—sprint planning, retrospectives, story points—but no understanding of why these matter. They know the mechanics, not the principles.

Why It Fails

The academic research identified the problem: teams follow practices without understanding the underlying principles. They cancel meetings when the Scrum Master is absent (because they’re used to managers running meetings). They bring irrelevant information to standups (because they think it’s about reporting, not coordinating). They wait for task assignments instead of self-organizing (because autonomy is scary). Leadership mandates “Agile transformation” without changing how they make decisions or interact with teams. They want faster delivery and better predictability—the outcomes—without the cultural changes that enable those outcomes.

The Real Problem

True Agile requires empowering teams to make decisions. Most organizations aren’t ready for that. They create pseudo-empowerment: teams can choose how to implement predetermined requirements. They can organize their work as long as they hit the deadlines. They can self-manage within tightly controlled boundaries.

Platform Engineering and the Infrastructure Complexity Trap

Docker and Kubernetes are powerful tools. They solve real problems. But here’s what nobody talks about: they add massive complexity, and most organizations don’t have the expertise to handle it. I watched small startup adopt Kubernetes. They could have run on services directly EC2 instances. Instead, they had a three-node cluster, service mesh, ingress controllers, the whole nine yards.

Platform Teams That Made Things Worse

Platform engineering was supposed to make developers’ lives easier. Instead, I’ve watched platform teams split by technology—the Kubernetes team, the Terraform team, the CI/CD team—each making things harder. The pattern was consistent: they’d either expose raw complexity or build leaky abstractions that constrained without simplifying. One platform team exposed raw Kubernetes YAML to developers, expecting them to become Kubernetes experts overnight.

The fundamental problem? Everyone had to understand Kubernetes, Istio, Terraform, and whatever else the platform team used. The abstractions leaked everywhere. And the platform teams didn’t understand what the application teams were actually building—they’d never worked with the gRPC services they were supposed to support. The result was bizarre workarounds. One team found Istio was killing their long-running database queries during deployments. Their solution? Set terminationDrainDuration to 2 hours. They weren’t experts in Istio, so instead of fixing the real problem—properly implementing graceful shutdown with query cancellation—they just cranked the timeout to an absurd value.

When something broke, nobody could tell if it was the app or the platform. Teams burned days or weeks debugging through countless layers of abstraction.

The Microservices Cargo Cult

Every company wants microservices now. It’s modern, it’s scalable, it’s what Amazon does. I’ve watched this pattern repeat across multiple companies. They split monoliths into microservices and get all the complexity without any of the benefits. Let me tell you what I’ve seen go wrong.

Idempotency? Never Heard of It

At one company, many services didn’t check for duplicate requests resulting in double charges or incorrect balances. Classic non-atomic check-then-act: check if transaction exists, then create it—two separate database calls. Race condition waiting to happen. Two requests hit simultaneously, both check, both see nothing, both charge the customer. Same pattern everywhere I looked. I wrote about these antipatterns in How Duplicate Detection Became the Dangerous Impostor of True Idempotency.

The Pub/Sub Disaster

At another place, Google Pub/Sub had an outage. Publishers timed out, retried their events. When Pub/Sub recovered, both original and retry got delivered—with different event IDs. Duplicate events everywhere. Customer updates applied twice. Transactions processed multiple times. The Events Service was built for speed, not deduplication. Each team handled duplicates their own way. Many didn’t handle them at all. We spent days manually finding data drift and fixing it. No automated reconciliation, no detection—just manual cleanup after the fact.

No Transaction Boundaries

Simple database joins became seven network calls across services. Create order -> charge payment -> allocate inventory -> update customer -> send notification. Each call a potential failure point. Something fails midway? Partial state scattered across services. No distributed transactions, no sagas, just hope. I explained proper implementation of transaction boundaries in Transaction Boundaries: The Foundation of Reliable Systems.

Missing the Basics

But the real problem was simpler than that. I’ve seen services deployed without:

  • Proper health checks. Teams reused the same shallow check for liveness and readiness. Kubernetes routed traffic to pods that weren’t ready.
  • Monitoring and alerts. Services ran in production with no alarms. We’d find out about issues from customer complaints.
  • Dependency testing. Nobody load tested their dependencies. Scaling up meant overwhelming downstream services that couldn’t handle the traffic.
  • Circuit breakers. One slow service took down everything calling it. No timeouts, no fallbacks.
  • Graceful shutdown. Deployments dropped requests because nobody coordinated shutdown timeouts between application, Istio, and Kubernetes.
  • Distributed tracing. Logs scattered across services with no correlation IDs. Debugging meant manually piecing together what happened from nine different log sources.
  • Backup and recovery. Nobody tested their disaster recovery until disaster struck.

The GRPC Disaster Nobody Talks About

Another organization went all-in on GRPC for microservices. The pitch was compelling: better performance, strongly typed interfaces, streaming support. What could go wrong? Engineers copied GRPC examples without understanding connection management. Nobody grasped how GRPC’s HTTP/2 persistent connections work or the purpose of connection pooling. Services would start before the Istio sidecar was ready. Application tries an outbound GRPC call—ECONNREFUSED. Pod crashes, Kubernetes restarts it, repeat. The fix was one annotation nobody added: sidecar.istio.io/holdApplicationUntilProxyStarts: "true".

Shutdown was worse. Kubernetes sends SIGTERM, Istio sidecar shuts down immediately, application still draining requests. Dropped connections everywhere. The fix required three perfectly coordinated timeout values:

  • Application shutdown: 40s
  • Istio drain: 45s
  • Kubernetes grace period: 65s

Load balancing was a disaster. HTTP/2 creates one persistent connection and multiplexes all requests through it. Kubernetes’ round-robin load balancing works at the connection level. Result? All traffic to whichever pod got the first connection. Health checks were pure theater. Teams copied the same probe definition for both liveness and readiness. Even distinct probes were “shallow”—a database ping that doesn’t validate the service can actually function. Services marked “ready” that immediately 500’d on real traffic.

The HTTP-to-GRPC proxy layer? Headers weren’t properly mapped between protocols. Auth tokens got lost in translation. Customer-facing errors were cryptic GRPC status codes instead of meaningful messages. I ended up writing detailed guides on GRPC load balancing in Kubernetes, header mapping, and error handling. These should have been understood before adoption, not discovered through production failures.

The Caching Silver Bullet That Shot Us in the Foot

“Just add caching” became the answer to every performance problem. Database slow? Add Redis. API slow? Add CDN. At one company, platform engineering initially didn’t support Redis. So application teams spun up their own clusters. No standards. No coordination. Just dozens of Redis instances scattered across environments, each configured differently. Eventually, platform engineering released Terraform modules for Redis. Problem solved, right? Wrong. They provided the infrastructure with almost no guidance on how to use it properly. Teams treated it as a magic performance button.

What Actually Happened

Teams started caching without writing fault-tolerant code. One service had Redis connection timeouts set to 30 seconds. When Redis became unavailable, every request waited 30 seconds to fail. The cascading failures took down the entire application. Another team cached massive objects—full customer balances, assets, events, transactions, etc. Their cache hydration on startup took 10 minutes. Every deploy meant 10 minutes of degraded performance while the cache warmed up. Auto-scaling was useless because new pods weren’t ready to serve traffic. Nobody calculated cache invalidation complexity. Nobody considered memory costs. Nobody thought about cache coherency across regions.

BiModal Hell

The worst part? BiModal logic. Cache hit? Fast. Cache miss? Slow. Cold cache? Everything’s slow until it warms up. This obscured real problems—race conditions, database failures—because performance was unpredictable. Was it slow because of a cache miss or because the database was dying? Nobody knew. I’ve documented more of these war stories—cache poisoning, thundering herds, memory leaks, security issues with unencrypted credentials. The pattern was always the same: reach for caching before understanding the actual problem.

Infrastructure as Code: The Code That Wasn’t

“We do infrastructure as code” was the proud claim at multiple companies I’ve worked at. The reality? Terraform or AWS CloudFormation templates existed, sure. But some of the infrastructure was still being created through admin console, modified through scripts, and updated through a mix of manual processes and half-automated pipelines. The worst part was the configuration drift. Each environment—dev, staging, production—was supposedly identical. In reality, they’d diverged so much that bugs would appear in production that were impossible to reproduce in staging. The CI/CD pipelines for application code ran smoothly, but infrastructure changes were often applied manually or through separate automation. Database migrations lived completely outside the deployment pipeline, making rollbacks impossible. One failed migration meant hours of manual recovery.

The Platform Engineering “Solution” That Made Everything Worse

At one platform engineering org, they provided reusable Terraform modules but required each application team to maintain their own configs for every environment. The modules covered maybe 50% of what teams actually needed, so teams built custom solutions, and created snowflakes. The whole point—consistency and maintainability—was lost.

The brilliant solution? A manager built a UI to abstract away Terraform entirely. Just click some buttons!It was a masterclass in leaky abstractions. You couldn’t do anything sophisticated, but when it broke, you had to understand both the UI’s logic AND the generated Terraform to debug it. The UI became a lowest-common-denominator wrapper inadequate for actual needs. I’ve seen AWS CDK provide excellent abstraction over CloudFormation—real programming language power with the ability to drop down to raw resources when needed. That’s proper abstraction: empowering developers, not constraining them. This UI understood nothing about developer needs. It was cargo cult thinking: “Google has internal tools, so we should build internal tools!” I’ve learned: engineers prefer CLI or API approaches to tooling. It’s scriptable, automatable, and fits into workflows. But executives see broken tooling and think the solution is slapping a UI on it—lipstick on a pig. It never works.

The Config Drift Nightmare

We claimed to practice “config as code.” Reality? Our config was scattered across:

  • Git repos (three different ones)
  • AWS Parameter Store
  • Environment variables set manually
  • Hardcoded in Docker images
  • Some in a random database table
  • Feature flags in LaunchDarkly
  • Secrets in three different secret managers

Dev environment had different configs than staging, which was different from production. Not by design—by entropy. Each environment had been hand-tweaked over years by different engineers solving different problems. Infrastructure changes were applied manually to environments through separate processes, completely bypassing synchronization with application code. Database migrations lived in four different directory structures across services, no standard anywhere.

Feature flags were even worse. Some teams used LaunchDarkly, others ZooKeeper, none integrated with CI/CD. Instead of templating configs or inheriting from a base, we maintained duplicate configs for every single environment. Copy-paste errors meant production regularly went down from missing or wrong values.

Feature Flags: When the Safety Net Becomes a Trap

I have seen companies buy expensive solutions like LaunchDarkly but fail to provide proper governance and standards. Google’s outage showed exactly what happens: a new code path protected by a feature flag went untested. When enabled, a nil pointer exception took down their entire service globally. The code had no error handling. The flag defaulted to ON. Nobody tested the actual conditions that would trigger the new path. I’ve seen the same pattern repeatedly. Teams deploy code behind flags, flip them on in production, and discover the code crashes. The flag was supposed to be the safety mechanism—it became the detonator. Following are a few common issues related to feature flags that I have observed:

No Integration

Flag changes weren’t integrated with our deployment pipeline. We treated them as configuration, not code. When problems hit, we couldn’t roll back cleanly. We’d deploy old code with new flag states, creating entirely new failure modes. No canary releases for flags. Teams would flip a flag for 100% of traffic instantly. No phased rollout. No monitoring the impact first. Just flip it and hope.

Misuse Everywhere

Teams used flags for everything: API endpoints, timeout values, customer tier logic. The flag system became a distributed configuration database. Nobody planned for LaunchDarkly being unavailable.

I’ve documented these antipatterns extensively—inadequate testing, no peer review, missing monitoring, zombie flags that never get removed. The pattern is always the same: treat flags as toggles instead of critical infrastructure that needs the same rigor as code.

The Observability Theater

At one company, they had a dedicated observability team monitoring hundreds of services across tens of thousands of endpoints. Sounds like the right approach, doesn’t it? The reality was they couldn’t actually monitor at that scale, so they defaulted to basic liveness checks. Is the service responding with 200 OK? Great, it’s “monitored.” We didn’t have synthetic health probes so customers found these issues before the monitoring did. Support tickets were our most reliable monitoring system.

Each service needed specific SLOs, custom metrics, detailed endpoint monitoring. Instead, we got generic dashboards and alerts that fired based on a single health check for all operations of a service. The solution was obvious: delegate monitoring ownership to service teams while the platform team provides tools and standards.

The Security Theater Performance

We had SOC2 compliance, which sales loved to tout. Reality? Internal ops and support had full access to customer data—SSNs, DOBs, government IDs—with zero guardrails and no auditing. I saw list APIs returned everything including SSNs, dates of birth, driver’s license numbers—all in the response. No field-level authorization. Teams didn’t understand authentication vs authorization. OAuth? Refresh tokens? “Too complicated.” They’d issue JWT tokens with 12-24 hour expiration. Session hijacking waiting to happen. Some teams built custom authorization solutions. Added 500ms latency to every request because they weren’t properly integrated with data sources. Overly complex permission systems that nobody understood. When they inevitably broke, services went down.

The Chicken Game

Most companies play security chicken. Bet on luck rather than investment. “We haven’t been breached yet, so we must be fine.” Until they’re not. The principle of least privilege? Never heard of it. I saw everyone in Devops teams gets admin access because it’s easier than managing permissions properly.

AI Makes It Worse

With AI, security got even sloppier. I’ve seen agentic AI code that completely bypasses authorization. The AI has credentials, the AI can do anything. No concept of user context or permissions. The Salesloft breach showed exactly what happens: their AI chatbot stored authentication tokens for hundreds of services—Salesforce, Slack, Google Workspace, AWS, Azure, OpenAI. Attackers stole them all. One breach, access to everything. Standards like MCP (Model Context Protocol) aren’t designed with security in mind. They give companies a false sense of security while creating massive attack surfaces. AI agents with broad access, minimal auditing, no principle of least privilege.

Training vs Reality

But we had mandatory security training! Eight hours of videos about not clicking phishing links. Nothing about secure coding, secret management, access control, or proper authentication. Nothing about OAuth flows, token rotation, or session management. We’d pass audits because we had the right documents. Incident response plans nobody tested. Encryption “at rest” that was just AWS defaults we never configured.

The On-Call Horror Show

Let me tell you about the most broken on-call setup I’ve seen. The PagerDuty escalation went: Engineer -> Head of Engineering. That’s it. No team lead, no manager, just straight from IC to executive.

The Escalation Disaster

New managers? Not in the escalation chain. Senior engineers? Excluded. Other teams skipped layers entirely—engineer to director, bypassing everyone in between. When reorganizations happened, escalation paths didn’t get updated. People left, new people joined, and PagerDuty kept paging people who’d moved to different teams or left the company entirely. Nobody had proper governance. No automated compliance checks. Escalation policies drifted until they bore no resemblance to the org chart.

Missing the Basics

Many services had inadequate SLOs and alerts defined. Teams would discover outages from customer complaints because there was no monitoring. The services that did have alerts? Engineers ignored them. Lower environment alerts went to Slack channels nobody read. Critical errors showed up in staging logs, but no one looked. The same errors would hit production weeks later, and everyone acted surprised. “This never happened before!” It did. In dev. In staging. Nobody checked.

Runbooks and Shadowing

I have seen many teams didn’t keep runbooks up to date. New engineers got added to on-call rotations without shadowing experienced people. One person knew how to handle each class of incident. When they were unavailable, everyone else fumbled through it.

We had the tool the “best” companies used, so we thought we must be doing it right.

The Remote Work Hypocrisy

I’ve been working remotely since 2015, long before COVID made it mainstream. When everyone went remote in 2020, I thought finally companies understood that location doesn’t determine productivity. Then came the RTO (Return to Office) mandates. CEOs talked about “collaboration” and “culture” while most team members were distributed across offices anyway. Having 2 out of 10 team members in the same office doesn’t create collaboration—it creates resentment.

I watched talented engineers leave rather than relocate. Companies used RTO as voluntary layoffs, losing their best people who had options. The cargo cult here? Copying each other’s RTO policies without examining their own situations.

Startups with twenty people and no proper office facilities demanded RTO because big tech was doing it. They had no data on productivity impact, no plan for making office time valuable, just blind imitation of companies with completely different contexts.

The AI Gold Rush

The latest cargo cult is AI adoption. CEOs mandate “AI integration” without thinking through actual use cases. I’ve watched this play out repeatedly.

The Numbers Don’t Lie

95% of AI pilots fail at large companies. McKinsey found 42% of companies using generative AI abandoned projects with “no significant bottom line impact.” But executives already got their stock bumps and bonuses before anyone noticed.

What Actually Fails

I’ve seen companies roll out AI tools with zero training. No prompt engineering guidance. No standardized tools—just a chaotic mess of ChatGPT, Claude, Copilot, whatever people found online. No policies. No metrics. Result? People tried it, got mediocre results, concluded AI was overhyped. The technology wasn’t the problem—the deployment was. Budget allocation is backwards. Companies spend 50%+ on flashy sales and marketing AI while back-office automation delivers the highest ROI. Why? Investors notice the flashy stuff.

The Code Quality Disaster

Here’s what nobody talks about: AI is producing mountains of shitty code. Most teams haven’t updated their SDLC to account for AI-generated code. Senior engineers succeed with AI; junior engineers don’t. Why? Because writing code was never the bottleneck—design and architecture are. You need skill to write proper prompts and critically review output. I’ve used Copilot since before ChatGPT, then Claude, Cursor, and a dozen others. They all have the same problems: limited context windows mean they ignore existing code. They produce syntactically correct code that’s architecturally wrong.

I’ve been using Claude Code extensively. Even with detailed plans and design docs, long sessions lose track of what was discussed. Claude thinks something is already implemented when it isn’t. Or ignores requirements from earlier in the conversation. The context window limitation is fundamental.

Cargo Cult Adoption

I’ve worked at companies where the CEO mandated AI adoption without defining problems to solve. People got promoted for claiming “AI adoption” with useless demos. Hackathon demos are great for learning—actual production integration is completely different. Teams write poor abstractions instead of using battle-tested frameworks like LangChain and LangGraph. They forget to sanitize inputs when using CrewAI. They deploy agents without proper context engineering, memory architecture, or governance.

At one company I worked at, we deployed AI agents without proper permission boundaries—no safeguards to ensure different users got different answers based on their access levels. The Salesforce breach showed what happens when you skip this step. Companies were reusing the same auth tokens in AI prompts and real service calls. No separation between what the AI could access and what the user should see.

The 5% That Work

The organizations that succeed do it differently:

  • Buy rather than build (67% success rate vs 33%)
  • Start narrow and deep—one specific problem done well
  • Focus on workflow integration, not flashy features
  • Actually train people on how to use the tools
  • Define metrics before deployment

The Productivity Theater

Companies announce layoffs and credit AI, but the details rarely add up. IBM’s CEO claimed AI replaced HR workers—viral posts said 8,000 jobs. Reality? About 200 people, and IBM’s total headcount actually increased. Klarna was more honest. Their CEO publicly stated AI helped shrink their workforce 40%—from 5,527 to 3,422 employees. But here’s the twist: they’re now hiring humans back because AI-driven customer service quality tanked. Builder.ai became a $1.5 billion unicorn claiming their AI “Natasha” automated coding. Turned out it was 700 Indian developers manually writing code while pretending to be AI. The company filed for bankruptcy in May 2025 after exposing not just the fake AI, but $220 million in fake revenue through accounting fraud. Founders had already stepped down.

Why This Is Dangerous

Unlike previous tech hype, AI actually works for narrow tasks. That success gets extrapolated into capabilities that don’t exist. As ACM notes about cargo cult AI, we’re mistaking correlation for causation, statistical patterns for understanding. AI can’t establish causality. It can’t reason from first principles. It can’t ask “why.” These aren’t bugs—they’re fundamental limitations of current approaches. The most successful AI deployments treat it as a tool requiring proper infrastructure: context management, semantic layers, memory architecture, governance. The 95% that fail skip all of this and wonder why their chatbot doesn’t work.

Breaking Free from the Cult

After years of watching this pattern, I’ve learned to recognize the warning signs:

The Name Drop: “Google/Amazon/Netflix does it this way” The Presentation: Slick slides, no substance The Resistance: Questioning is discouraged The Metrics: Activity over outcomes The Evangelists: True believers who’ve never seen it fail

The antidote is simple but not easy:

  1. Ask Why: Not just why others do it, but why you should
  2. Start Small: Pilot programs reveal problems before they metastasize
  3. Measure Impact: Real metrics, not vanity metrics
  4. Listen to Skeptics: They often see what evangelists miss
  5. Accept Failure: Admitting mistakes early is cheaper than denying them

The Truth About Cargo Cult Culture

After living through all this, I’ve realized cargo cult software engineering isn’t random. It’s systematic. It starts at the top with executives who believe that imitating success is the same as achieving it. They hire from big tech not for expertise, but for credibility. “We have ex-Google engineers!” becomes the pitch, even if those engineers were junior PMs who never touched the systems they’re now supposed to recreate.

These executives enable sales and marketing to sell fiction. “Fake it till you make it” becomes company culture. Engineering bears the burden of making lies true, burning out in the process. The engineers who point out that the emperor has no clothes get labeled as “not team players.” The saddest part? Some of these companies could have been successful with honest, appropriate technology choices. But they chose cosplay over reality, form over function, complexity over simplicity.

The Way Out

I’ve learned to spot these situations in interviews now. When they brag about their tech stack before mentioning what problem they solve, I run. When they name-drop companies instead of explaining their architecture, I run. When they say “we’re the Uber of X” or “we’re building the next Google,” I run fast.

The antidote isn’t just asking “why” – it’s demanding proof. Show me the metrics that prove Kubernetes saves you money. Demonstrate that microservices made you faster. Prove that your observability actually prevents outages. Most can’t, because they never measured before and after. They just assumed newer meant better, complex meant sophisticated, and copying meant competing.

Your context is not Google’s context. Your problems are not Amazon’s problems. And that’s okay. Solve your actual problems with boring, appropriate technology. Your customers don’t care if you use Kubernetes or Kafka or whatever this week’s hot technology is. They care if your shit works. Stop building bamboo airports. Start shipping working software.

October 14, 2025

Agentic AI for API Compatibility: Building Intelligent Guardians with LangChain and LangGraph

Filed under: Computing — admin @ 2:02 pm

Introduction

I’ve been in software development for decades, and if there’s one lesson that’s been burned into my memory through countless production incidents, it’s this: innocuous-looking API changes have an uncanny ability to break everything. You’re getting alerts—an API change that sailed through testing is breaking production. Customer support is calling. You’re coordinating an emergency rollback, wondering how your tests missed this entirely.

The Problem We Keep Facing

Throughout my career, I’ve watched teams struggle with the same challenge: API evolution shouldn’t be a game of Russian roulette. Yet “safe” changes repeatedly pass tests only to break production. Unit testing doesn’t catch the subtle semantic changes that break client integrations. For years, I’ve been building tools to solve this. I created PlexMockServices for API mocking, then evolved it into api-mock-service with full mock and contract testing support. These tools have saved us from many production incidents. I have also written about various testing methodologies for validating APIs such as:

When gRPC and Protocol Buffers arrived, I thought we’d finally solved it. Tools like Buf excel at catching wire-level protocol changes—remove a field, Buf catches it. But here’s what I discovered: Buf and similar tools only see part of the picture.

The Blind Spots

Traditional static analysis tools understand syntax but not semantics. They catch structural changes but miss:

  • Fields made required through validation rules—wire-compatible, but every client fails
  • Fields that were “always” populated until you made them conditional
  • Error messages that clients parse with regex
  • Sort orders that changed, breaking customer dashboards
  • Default values that shifted behavior

With enough users, all observable behaviors will be depended upon—that’s Hyrum’s Law. The challenge isn’t just detecting changes; it’s understanding their impact from every consumer’s perspective.

Enter Agentic AI

Over the past year, I’ve been experimenting with combining static analysis tools like Buf with the contextual understanding of Large Language Models. Not to replace traditional tools, but to augment them—to catch what they structurally cannot see. In this blog, I’ll show you how to build an intelligent API guardian using LangChain and LangGraph—an agentic AI system that:

  • Orchestrates multiple tools (Git, Buf, LLMs) in coordinated workflows
  • Understands not just what changed, but what it means
  • Catches both wire-level and semantic breaking changes
  • Explains why something breaks and how to fix it
  • Makes autonomous deployment decisions based on comprehensive analysis

Let me show you how we built this system and how you can implement it for your APIs. Those emergency customer calls about broken integrations might just become a thing of the past.

Architecture Overview: The Intelligent Pipeline

The key insight behind this approach is that no single tool can catch all breaking changes. Static analyzers like Buf excel at structural validation but can’t reason about semantics. LLMs understand context and business logic but lack the deterministic guarantees of rule-based systems. The solution? Combine them in an orchestrated pipeline where each component contributes its strengths.

What I’ve built is an intelligent pipeline that layers multiple detection strategies:

  • Buf provides fast, deterministic detection of wire-level protocol violations
  • LangGraph orchestrates a stateful workflow that coordinates all the analysis steps
  • LangChain manages the LLM interactions, handling prompts, retries, and structured output parsing
  • Vertex AI/Gemini brings semantic understanding to analyze what changes actually mean for API consumers

Here’s how these components work together in practice:

Setting Up the Environment

Let’s walk through setting up this system step by step. We’ll use a sample Todo API project as our example.

Prerequisites

# Clone the sample repository
git clone https://github.com/bhatti/todo-api-errors.git
cd todo-api-errors/check-api-break-automation

# Create Python virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Installing Buf

Buf is essential for proto file analysis:

# macOS
brew install bufbuild/buf/buf

# Linux
curl -sSL "https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64" -o /usr/local/bin/buf
chmod +x /usr/local/bin/buf

# Verify installation
buf --version

Configuring Google Cloud and Vertex AI

  1. Set up GCP Project:
# Install gcloud CLI if not already installed
# Follow: https://cloud.google.com/sdk/docs/install

# Authenticate
gcloud auth application-default login

# Set your project
gcloud config set project YOUR_PROJECT_ID
  1. Enable Vertex AI API:
gcloud services enable aiplatform.googleapis.com
  1. Create Configuration File:
# Create .env file
cat > .env << EOF
GCP_PROJECT=your-project-id
GCP_REGION=us-central1
VERTEX_AI_MODEL=gemini-2.0-flash-exp
EOF

Implementation Deep Dive

The LangGraph State Machine

Our implementation uses LangGraph to create a deterministic workflow for analyzing API changes:

Here’s the core LangGraph implementation:

from langgraph.graph import StateGraph, MessagesState
from typing import TypedDict, List, Dict, Any
import logging

class CompatibilityState(TypedDict):
    """State for the compatibility checking workflow"""
    workspace_path: str
    proto_files: List[str]
    git_diff: str
    buf_results: Dict[str, Any]
    ai_analysis: Dict[str, Any]
    final_report: Dict[str, Any]
    can_deploy: bool

class CompatibilityChecker:
    def __init__(self, project_id: str, model_name: str = "gemini-2.0-flash-exp"):
        self.logger = logging.getLogger(__name__)
        self.project_id = project_id
        self.model = self._initialize_llm(model_name)
        self.workflow = self._build_workflow()

    def _build_workflow(self) -> StateGraph:
        """Build the LangGraph workflow"""
        workflow = StateGraph(CompatibilityState)

        # Add nodes for each step
        workflow.add_node("load_protos", self.load_proto_files)
        workflow.add_node("get_diff", self.get_git_diff)
        workflow.add_node("buf_check", self.run_buf_analysis)
        workflow.add_node("ai_analysis", self.run_ai_analysis)
        workflow.add_node("generate_report", self.generate_report)

        # Define the flow
        workflow.add_edge("load_protos", "get_diff")
        workflow.add_edge("get_diff", "buf_check")
        workflow.add_edge("buf_check", "ai_analysis")
        workflow.add_edge("ai_analysis", "generate_report")

        # Set entry point
        workflow.set_entry_point("load_protos")
        workflow.set_finish_point("generate_report")

        return workflow.compile()

Intelligent Prompt Engineering

The key to accurate breaking change detection lies in the prompt design. Here’s our approach:

def create_analysis_prompt(self, diff: str, buf_results: dict) -> str:
    """Create a comprehensive prompt for the LLM"""
    return f"""
    You are an API compatibility expert analyzing protobuf changes.

    CONTEXT:
    - This is a production API with existing consumers
    - Breaking changes can cause service outages
    - We follow semantic versioning principles

    STATIC ANALYSIS RESULTS:
    {json.dumps(buf_results, indent=2)}

    GIT DIFF:
    ```
    {diff}
    ```

    ANALYZE THE FOLLOWING:
    1. Wire-level breaking changes (trust buf results completely)
    2. Semantic breaking changes:
       - Required fields added without defaults
       - Field removals (always breaking)
       - Type changes that lose precision
       - Enum value removals or reordering

    3. Behavioral concerns:
       - Fields that might be parsed by consumers
       - Error message format changes
       - Ordering or filtering logic changes

    CRITICAL RULES:
    - If buf reports breaking changes, mark them as is_breaking=true
    - Field removal is ALWAYS breaking (severity: HIGH)
    - Adding REQUIRED fields is breaking (severity: MEDIUM-HIGH)
    - Be conservative - when in doubt, flag as potentially breaking

    OUTPUT FORMAT:
    Return a JSON object with this structure:
    {{
        "changes": [...],
        "overall_severity": "NONE|LOW|MEDIUM|HIGH|CRITICAL",
        "can_deploy": true|false,
        "recommendations": [...]
    }}
    """

Real-World Example: When Buf Missed Half the Problem

Let me show you exactly why we need AI augmentation with a concrete example. I’m going to intentionally break a Todo API in two different ways to demonstrate the difference between what traditional tools catch versus what our AI-enhanced system detects.

The Original Proto File

message Task {
  string id = 1;
  string title = 2;
  string description = 3;  // This field will be removed
  bool completed = 4;
  google.protobuf.Timestamp created_at = 5;
  google.protobuf.Timestamp updated_at = 6;
  repeated string tags = 7;
  TaskPriority priority = 8;
  string assignee_id = 9;
  google.protobuf.Timestamp due_date = 10;
  repeated Comment comments = 11;
}

The Modified Proto File

message Task {
  string id = 1;
  string title = 2;
  // REMOVED: string description = 3;
  bool completed = 4;
  google.protobuf.Timestamp created_at = 5;
  google.protobuf.Timestamp updated_at = 6;
  repeated string tags = 7;
  TaskPriority priority = 8;
  string assignee_id = 9;
  google.protobuf.Timestamp due_date = 10;
  repeated Comment comments = 11;

  // NEW REQUIRED FIELD ADDED:
  TaskMetadata metadata = 12 [(validate.rules).message.required = true];
}

message TaskMetadata {
  string created_by = 1;
  int64 version = 2;
  map<string, string> labels = 3;
}

What Buf Detected

When we ran buf breaking --against '.git#branch=main', Buf only detected one breaking change:

api/proto/todo/v1/todo.proto:83:3:Field "3" with name "description" on message "Task" was deleted.

Why did Buf miss the second breaking change? Because adding a field with [(validate.rules).message.required = true] is an application-level annotation, not a wire-protocol breaking change. Buf focuses on wire compatibility – it doesn’t understand application-level validation rules.

What Our AI-Enhanced System Detected

Here’s the actual output from our tool:

2025-10-14 18:29:11,388 - __main__ - INFO - Collecting git diffs...
2025-10-14 18:29:11,392 - __main__ - INFO - Analyzing with LLM...
2025-10-14 18:29:14,471 - __main__ - INFO - Generating final report...
================================================================================
API BACKWARD COMPATIBILITY REPORT
================================================================================
Timestamp: 2025-10-14T18:29:14.471705
Files Analyzed: api/proto/todo/v1/todo.proto
Total Changes: 2
Breaking Changes: 2
Overall Severity: HIGH
Can Deploy: NO

DETECTED CHANGES:
----------------------------------------
1. Removed field 'description'
   Location: api/proto/todo/v1/todo.proto:83
   Category: field_removal
   Breaking: YES
   Severity: HIGH
   Recommendation: Consider providing a migration path for clients relying on this field.

2. Added required field 'metadata'
   Location: api/proto/todo/v1/todo.proto:136
   Category: field_addition
   Breaking: YES
   Severity: HIGH
   Recommendation: Ensure all clients are updated to include this field before deployment.

LLM ANALYSIS:
----------------------------------------
The changes include the removal of the 'description' field and the addition of a required
'metadata' field, both of which are breaking changes.

================================================================================
2025-10-14 18:29:14,472 - __main__ - INFO - JSON report saved to results/non_breaking.json

The “Aha!” Moment

This is exactly the scenario I warned about in my presentation. Here’s what happened:

  1. Buf did its job – It caught the field removal. That’s wire-level breaking change detection working as designed.
  2. But Buf has blind spots – It completely missed the required field addition because [(validate.rules).message.required = true] is an application-level annotation. To Buf, it’s just another optional field on the wire.
  3. The AI understood context – Our LLM looked at that validation rule and immediately recognized: “Hey, this server is going to reject any request without this field. That’s going to break every existing client!”

Think about it – if we had only relied on Buf, we would have deployed thinking we fixed the one breaking change. Then boom – production down because no existing client sends the new metadata field. This is precisely why we need AI augmentation. It’s not about replacing Buf – it’s about catching what Buf structurally cannot see.

Beyond This Example

This pattern repeats across many scenarios that static analysis misses:

  • Validation rules that make previously optional behavior mandatory
  • Fields that were always populated but are now conditional
  • Changes to default values that alter behavior
  • Error message format changes (clients parse these!)
  • Response ordering changes (someone always depends on order)
  • Rate limiting or throttling policy changes
  • Authentication requirements that changed

Integrating with CI/CD

The tool can be integrated into your CI/CD pipeline:

# .github/workflows/api-compatibility.yml
name: API Compatibility Check

on:
  pull_request:
    paths:
      - '**/*.proto'

jobs:
  check-breaking-changes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0  # Need full history for comparison

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install Buf
        run: |
          curl -sSL "https://github.com/bufbuild/buf/releases/latest/download/buf-Linux-x86_64" -o /usr/local/bin/buf
          chmod +x /usr/local/bin/buf

      - name: Install dependencies
        run: |
          pip install -r check-api-break-automation/requirements.txt

      - name: Run compatibility check
        env:
          GCP_PROJECT: ${{ secrets.GCP_PROJECT }}
        run: |
          cd check-api-break-automation
          python api_compatibility_checker.py \
            --workspace .. \
            --against origin/main \
            --output results/pr-check.json

      - name: Comment PR with results
        if: always()
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('check-api-break-automation/results/pr-check.json'));

            const comment = `## ? API Compatibility Check Results

            **Can Deploy**: ${results.can_deploy ? '? Yes' : '? No'}
            **Severity**: ${results.overall_severity}
            **Breaking Changes**: ${results.summary.total_breaking_changes}

            ${results.can_deploy ? '' : '### ?? Breaking Changes Detected\n' + results.recommendations.join('\n')}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

Advanced Features: RAG and MCP in Action

1. RAG (Retrieval-Augmented Generation): Learning from Past Mistakes

One of the most powerful aspects of our system is how it learns from history. Here’s how RAG actually works in our implementation:

from langchain.vectorstores import Chroma
from langchain.embeddings import VertexAIEmbeddings
from langchain.schema import Document

class BreakingChangeKnowledgeBase:
    """RAG system that learns from past breaking changes"""

    def __init__(self, project_id: str):
        self.embeddings = VertexAIEmbeddings(
            model_name="textembedding-gecko@003",
            project=project_id
        )
        # Store historical breaking changes in vector database
        self.vector_store = Chroma(
            collection_name="api_breaking_changes",
            embedding_function=self.embeddings,
            persist_directory="./knowledge_base"
        )

    def index_breaking_change(self, change_data: dict):
        """Store a breaking change incident for future reference"""
        doc = Document(
            page_content=f"""
            Proto Change: {change_data['diff']}
            Breaking Type: {change_data['type']}
            Customer Impact: {change_data['impact']}
            Resolution: {change_data['resolution']}
            """,
            metadata={
                "severity": change_data['severity'],
                "date": change_data['date'],
                "service": change_data['service'],
                "prevented": change_data.get('caught_before_prod', False)
            }
        )
        self.vector_store.add_documents([doc])

    def find_similar_changes(self, current_diff: str, k: int = 5):
        """Find similar past breaking changes"""
        results = self.vector_store.similarity_search_with_score(
            current_diff,
            k=k,
            filter={"severity": {"$in": ["HIGH", "CRITICAL"]}}
        )
        return results

# How it's used in the main checker:
class CompatibilityChecker:
    def __init__(self, project_id: str):
        self.knowledge_base = BreakingChangeKnowledgeBase(project_id)

    def run_ai_analysis(self, state: dict):
        """Enhanced AI analysis using RAG"""
        # Find similar past incidents
        similar_incidents = self.knowledge_base.find_similar_changes(
            state['git_diff']
        )

        # Build context from past incidents
        historical_context = ""
        if similar_incidents:
            historical_context = "\n\nSIMILAR PAST INCIDENTS:\n"
            for doc, score in similar_incidents:
                if score > 0.8:  # High similarity
                    historical_context += f"""
                    - Previous incident: {doc.metadata['date']}
                      Impact: {doc.page_content}
                      This suggests high risk of similar issues.
                    """

        # Include historical context in prompt
        enhanced_prompt = f"""
        {self.base_prompt}

        {historical_context}

        Based on historical patterns, pay special attention to similar past issues.
        """

        return self.llm.invoke(enhanced_prompt)

2. Model Context Protocol (MCP) Integration

MCP allows our AI to interact with external tools seamlessly. Here’s the actual implementation:

# mcp_server.py - MCP server for API compatibility tools
from mcp.server import MCPServer
from mcp.tools import Tool, ToolResult
import subprocess
import json

class APICompatibilityMCPServer(MCPServer):
    """MCP server exposing API compatibility tools to AI agents"""

    def __init__(self):
        super().__init__("api-compatibility-checker")
        self.register_tools()

    def register_tools(self):
        """Register all available tools"""

        @self.tool("buf_lint")
        async def buf_lint(proto_path: str) -> ToolResult:
            """Run buf lint on proto files"""
            result = subprocess.run(
                ["buf", "lint", proto_path],
                capture_output=True,
                text=True
            )
            return ToolResult(
                success=result.returncode == 0,
                output=result.stdout,
                error=result.stderr
            )

        @self.tool("buf_breaking")
        async def buf_breaking(proto_path: str, against: str = "main") -> ToolResult:
            """Check for breaking changes using buf"""
            cmd = [
                "buf", "breaking",
                "--against", f".git#branch={against}",
                "--path", proto_path
            ]
            result = subprocess.run(cmd, capture_output=True, text=True)

            # Parse breaking changes
            breaking_changes = []
            for line in result.stdout.splitlines():
                if line.strip():
                    breaking_changes.append(self.parse_buf_output(line))

            return ToolResult(
                success=True,
                data={
                    "has_breaking": len(breaking_changes) > 0,
                    "changes": breaking_changes,
                    "raw_output": result.stdout
                }
            )

        @self.tool("check_consumer_contracts")
        async def check_contracts(service: str, version: str) -> ToolResult:
            """Check if change breaks consumer contracts"""
            # This connects to our contract testing system
            contracts = self.load_consumer_contracts(service)
            violations = []

            for contract in contracts:
                if not self.validate_contract(contract, version):
                    violations.append({
                        "consumer": contract["consumer"],
                        "expectation": contract["expectation"],
                        "impact": "Contract violation detected"
                    })

            return ToolResult(
                success=True,
                data={
                    "total_consumers": len(contracts),
                    "violations": violations,
                    "safe_to_deploy": len(violations) == 0
                }
            )

        @self.tool("generate_migration_guide")
        async def generate_migration(breaking_changes: list) -> ToolResult:
            """Generate migration guide for breaking changes"""
            guide = self.create_migration_steps(breaking_changes)
            return ToolResult(
                success=True,
                data={"migration_guide": guide}
            )

# How LangChain uses MCP tools:
from langchain.agents import create_mcp_agent
from langchain_mcp import MCPToolkit

# Initialize MCP toolkit
mcp_toolkit = MCPToolkit(
    server_url="http://localhost:8080",  # MCP server endpoint
    available_tools=["buf_lint", "buf_breaking", "check_consumer_contracts"]
)

# Create agent with MCP tools
agent = create_mcp_agent(
    llm=llm,
    tools=mcp_toolkit.get_tools(),
    system_prompt="""
    You are an API compatibility expert. Use the available MCP tools to:
    1. Run buf lint and breaking checks
    2. Verify consumer contracts
    3. Generate migration guides when needed

    Always check consumer contracts after detecting breaking changes.
    """
)

# Usage in the main workflow
class CompatibilityChecker:
    def __init__(self):
        self.mcp_agent = agent

    def comprehensive_check(self, proto_path: str):
        """Run comprehensive compatibility check using MCP tools"""

        # Let the agent orchestrate the tools
        result = self.mcp_agent.invoke({
            "input": f"""
            Analyze {proto_path} for breaking changes:
            1. Run buf lint first
            2. Check breaking changes against main branch
            3. If breaking changes found, check consumer contracts
            4. Generate migration guide if needed
            """
        })

        return result

3. How RAG + MCP Work Together

Here’s the magic – combining RAG’s historical knowledge with MCP’s tool access:

class IntelligentAPIGuardian:
    """Combines RAG and MCP for comprehensive analysis"""

    def analyze_change(self, proto_diff: str):
        # Step 1: Use MCP to run all tools
        mcp_results = self.mcp_agent.invoke({
            "input": f"Analyze this diff: {proto_diff}"
        })

        # Step 2: Use RAG to find similar past incidents
        historical_data = self.knowledge_base.find_similar_changes(proto_diff)

        # Step 3: Combine insights
        combined_analysis = self.llm.invoke(f"""
        Current change analysis from tools:
        {mcp_results}

        Historical patterns from similar changes:
        {historical_data}

        Synthesize a comprehensive risk assessment considering both
        current tool results and historical precedents.

        If historical data shows issues that tools didn't catch,
        flag them as "HISTORICAL_RISK" items.
        """)

        # Step 4: Store this analysis for future RAG queries
        if combined_analysis['has_breaking_changes']:
            self.knowledge_base.index_breaking_change({
                'diff': proto_diff,
                'type': combined_analysis['breaking_type'],
                'impact': combined_analysis['impact'],
                'resolution': combined_analysis['recommendations'],
                'severity': combined_analysis['severity'],
                'date': datetime.now(),
                'caught_before_prod': True
            })

        return combined_analysis

The Power of This Combination:

  • MCP gives us real-time tool access – running buf, checking contracts, generating migrations
  • RAG gives us institutional memory – learning from every incident, getting smarter over time
  • Together they catch issues that neither could find alone

For example, RAG might recall “last time we added a required field to Task, the mobile team’s app crashed because they cache responses for 24 hours” – something no static tool would know, but crucial for preventing an outage.

Testing the System

Here’s a complete walkthrough of testing the system:

# 1. First, verify your setup
python test_simple.py

# Output should show:
# ? All core modules imported successfully
# ? Proto file found
# ? Proto modifier works - 12 test scenarios available
# ? Buf integration initialized successfully
# ? GCP_PROJECT configured: your-project-id
# ? Vertex AI connection verified

# 2. Make breaking changes to the proto file
python proto_modifier.py ../api/proto/todo/v1/todo.proto \
  --scenario remove_field

python proto_modifier.py ../api/proto/todo/v1/todo.proto \
  --scenario add_required_field

# 3. Run the compatibility checker
python api_compatibility_checker.py \
  --workspace .. \
  --against '.git#branch=main' \
  --output results/breaking_changes.json

# 4. Review the detailed report
cat results/breaking_changes.json | jq '.'

Lessons Learned and Best Practices

  1. Combine Multiple Analysis Methods: Static analysis catches structure, AI catches semantics
  2. Use Conservative Defaults: When uncertain, flag as potentially breaking
  3. Provide Clear Explanations: Developers need to understand why something is breaking
  4. Version Your Prompts: Treat prompts as code – version and test them
  5. Monitor LLM Costs: Use caching and optimize prompt sizes
  6. Implement Gradual Rollout: Start with warnings before blocking deployments
  7. Build Team Trust Gradually: Don’t start by blocking deployments. Run in shadow mode first, report findings alongside Buf results, and let teams see the value before enforcement. Track false positives and tune your prompts based on real feedback.
  8. Document Your Prompts: Your prompt engineering is as critical as your code. Version control your prompts, document why certain instructions exist, and treat them as first-class artifacts that need testing and review.

The Power of Agentic AI

What makes this approach “agentic” rather than just AI-assisted?

  1. Autonomous Decision Making: The system doesn’t just flag issues – it makes decisions whether API changes can deployed
  2. Multi-Step Reasoning: It performs complex analysis chains without human intervention
  3. Tool Integration: It orchestrates multiple tools (Git, Buf, LLMs) to achieve its goal
  4. Contextual Understanding: It considers historical patterns and project-specific rules
  5. Actionable Output: It provides specific remediation steps, not just warnings

Future Enhancements

The roadmap for this tool includes:

  1. Multi-Protocol Support: Extend beyond protobuf/gRPC to OpenAPI and GraphQL
  2. Behavioral Testing: Integration with contract testing frameworks
  3. Auto-Migration Generation: Create migration scripts for breaking changes
  4. Client SDK Updates: Automatically update client libraries
  5. Performance Impact Analysis: Predict performance implications of changes

Known Limitations: This system excels at catching semantic and behavioral changes, but it’s not perfect. It can’t predict how undocumented client implementations behave, can’t catch changes in external dependencies your API relies on, and can’t guarantee zero false positives. Human judgment remains essential—especially for nuanced cases where breaking changes might be intentional and necessary.

Conclusion

Throughout my decades in software development, I’ve learned that API compatibility isn’t just about wire protocols and field numbers. It’s about understanding how our users actually depend on our APIs—all the documented behaviors, the undocumented quirks, and yes, even the bugs they’ve built workarounds for. Traditional static analysis tools like Buf are essential—they catch structural breaking changes with perfect precision. But as we’ve seen with the required field example, they can’t reason about semantic changes, business context, or application-level validation rules. That’s where AI augmentation transforms the game. By combining Buf’s deterministic analysis with an LLM’s contextual understanding through LangChain and LangGraph, we’re not just catching more bugs—we’re fundamentally changing how we think about API evolution.

The complete implementation, including all the code and configurations demonstrated in this article, is available at: https://github.com/bhatti/todo-api-errors. Fork it, experiment with it, break it, improve it.

Resources and References


Postel’s Law: “Be conservative in what you send, liberal in what you accept” – but with Agentic AI, we can be intelligent about both.

Hyrum’s Law: “With a sufficient number of users, all observable behaviors will be depended upon” – which is why we need AI to catch the subtle breaking changes that static analysis misses.

October 10, 2025

How Abstraction is Killing Software: A 30-Year Journey Through Complexity

Filed under: Computing — Tags: , — admin @ 10:07 pm

The Promise and the Problem

I’ve been writing software for over 30 years. In the 1990s, I built client-server applications with Visual Basic or X/Motif frontends talking to SQL databases. The entire stack fit in my head. When something broke, I could trace the problem in minutes. Today, a simple API request traverses so many layers of abstraction that debugging feels like archaeological excavation through geological strata of technology.

Here’s what a typical request looks like now:

Each layer promises to solve a problem. Each layer delivers on that promise. And yet, the cumulative effect is a system so complex that even experienced engineers struggle to reason about it. I understand that abstraction is essential—it’s how we manage complexity and build on the shoulders of giants. But somewhere along the way, we crossed a threshold. We’re now spending more time managing our abstractions than solving business problems.

The Evolutionary History of Abstraction Layers

The Package Management Revolution

Though, design principles like DRY (don’t repeat yourself) and reusable components have been part of software development for a long time. But I first realized the impact of it when I used PERL’s CPAN in the 1990s. I used it extensively with the Mason web templating system at a large online retailer. It worked beautifully until it didn’t. Then came the avalanche: Maven for Java, pip for Python, npm for JavaScript, RubyGems, Cargo for Rust. Each language needed its own package ecosystem. Each package could depend on other packages, which depended on other packages, creating dependency trees that looked like fractals.

The problem isn’t package management itself—it’s that we never developed mature patterns for managing these dependencies at scale. A single Go project might pull in hundreds of transitive dependencies, each a potential security vulnerability. The npm ecosystem exemplifies this chaos. I remember the left-pad incident in 2016 when a developer unpublished his 11-line package that padded strings with spaces. Thousands of projects broke overnight—Babel, React, and countless applications—because they depended on it through layers of transitive dependencies. Eleven lines of code that any developer could write in 30 seconds brought the JavaScript ecosystem to a halt.

This pattern repeats constantly. I’ve seen production applications import packages for:

  • is-odd / is-even: Check if a number is odd (return n % 2 === 1)
  • is-array: Check array type (JavaScript has Array.isArray() built-in)
  • string-split: Split text (seriously)

Each trivial dependency multiplies risk. The 2021 colors.js and faker.js sabotage showed how one maintainer intentionally broke millions of projects with infinite loops. The Go ecosystem has seen malicious typosquatted packages targeting cryptocurrency wallets. Critical vulnerabilities in golang.org/x/crypto and golang.org/x/net require emergency patches that cascade through entire dependency chains.

We’ve normalized depending on thousands of external packages for trivial functionality. It’s faster to go get a package than write a 5-line function, but we pay for that convenience with complexity, security risk, and fragility that compounds with every added dependency.

The O/R Mapping Disaster

In the 1990s and early 2000s, I was greatly influenced by Martin Fowler’s books like Analysis Patterns and Patterns of Enterprise Application Architecture. These books introduced abstractions for database like Active Record and Data Mapper. On Java platform, I used Hibernate that provided implementation of Data Mapper for mapping objects to database tables (also called O/R mapping). On Ruby on Rails platform, I used Active Record pattern for similar abstraction. I watched teams define elaborate object graphs with lazy loading, eager loading, and cascading relationships.

The result? What should have been a simple query became a performance catastrophe. You’d ask for a User object and get back an 800-pound gorilla holding your user—along with every related object, their related objects, and their related objects. This is also called the “N+1 problem,” and it destroyed application performance.

Here’s what I mean in Go with GORM:

// Looks innocent enough
type User struct {
    ID       uint
    Name     string
    Posts    []Post    // One-to-many relationship
    Profile  Profile   // One-to-one relationship
    Comments []Comment // One-to-many relationship
}

// Simple query, right?
var user User
db.Preload("Posts").Preload("Profile").Preload("Comments").First(&user, userId)

// But look at what actually executes:
// Query 1: SELECT * FROM users WHERE id = ?
// Query 2: SELECT * FROM posts WHERE user_id = ?
// Query 3: SELECT * FROM profiles WHERE user_id = ?
// Query 4: SELECT * FROM comments WHERE user_id = ?

Now imagine fetching 100 users:

var users []User
db.Preload("Posts").Preload("Profile").Preload("Comments").Find(&users)

// That's potentially 301 database queries!
// 1 query for users
// 100 queries for posts (one per user)
// 100 queries for profiles
// 100 queries for comments

The abstraction leaked everywhere. To use GORM effectively, you needed to understand SQL, database indexes, query optimization, connection pooling, transaction isolation levels, and GORM’s caching strategies. The abstraction didn’t eliminate complexity; it added a layer you also had to master.

Compare this to someone who understands SQL:

type UserWithDetails struct {
    User
    PostCount    int
    CommentCount int
}

// One query with proper joins
query := `
    SELECT 
        u.*,
        COUNT(DISTINCT p.id) as post_count,
        COUNT(DISTINCT c.id) as comment_count
    FROM users u
    LEFT JOIN posts p ON u.id = p.user_id
    LEFT JOIN comments c ON u.id = c.user_id
    GROUP BY u.id
`

var users []UserWithDetails
db.Raw(query).Scan(&users)

One query. 300x faster. But this requires understanding how databases work, not just how ORMs work.

The Container Revolution and Its Discontents

I started using VMware in the early 2000s. It was magical—entire operating systems running in isolation. When Amazon launched EC2 in 2006, it revolutionized infrastructure by making virtualization accessible at scale. EC2 was built on Xen hypervisor—an open-source virtualization technology that allowed multiple operating systems to run on the same physical hardware. Suddenly, everyone was deploying VM images: build an image, install your software, configure everything, and deploy it to AWS.

Docker simplified this in 2013. Instead of full VMs running complete operating systems, you had lightweight containers sharing the host kernel. Then Kubernetes arrived in 2014 to orchestrate those containers. Then service meshes like Istio appeared in 2017 to manage the networking between containers. Still solving real problems!

But look at what we’ve built:

# A "simple" Kubernetes deployment for a Go service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        # Istio: Wait for proxy to start before app
        sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
        # Istio: Keep proxy alive during shutdown
        proxy.istio.io/config: '{"proxyMetadata":{"EXIT_ON_ZERO_ACTIVE_CONNECTIONS":"true"}}'
        # Istio: How long to drain connections
        sidecar.istio.io/terminationDrainDuration: "45s"
    spec:
      containers:
      - name: app
        image: user-service:latest
        ports:
        - containerPort: 8080
        # Delay shutdown to allow load balancer updates
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
        # Check if process is alive
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 30
        # Check if ready to receive traffic
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        # Check if startup completed
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          failureThreshold: 30
          periodSeconds: 10
      # How long to wait before force-killing
      terminationGracePeriodSeconds: 65

This configuration is trying to solve one problem: gracefully shut down a service without dropping requests. But look at all the coordination required:

  • The application needs to handle SIGTERM
  • The readiness probe must stop returning healthy
  • The Istio sidecar needs to drain connections
  • The preStop hook delays shutdown
  • Multiple timeout values must be carefully orchestrated
  • If any of these are misconfigured, you drop requests or deadlock

I have encountered countless incidents at work due to misconfiguration of these parameters and teams end up spending endless hours to debug these issues. I explained some of these startup/shutdown coordination issues in Zero-Downtime Services with Lifecycle Management on Kubernetes and Istio.

The Learning Curve Crisis: From BASIC to “Full Stack”

When I Started: 1980s BASIC

10 PRINT "WHAT IS YOUR NAME?"
20 INPUT NAME$
30 PRINT "HELLO, "; NAME$
40 END

That was a complete program. I could write it, run it, understand every line, and explain to someone else how it worked—all in 10 minutes. When I learned programming in the 1980s, you could go from zero to writing useful programs in a few weeks. The entire BASIC language fit on a reference card that came with your computer. You didn’t need to install anything. You turned on the computer and you were programming.

Today’s “Hello World” in Go

Here’s what you need to know to build a modern web application:

Backend (Go):

package main

import (
    "context"
    "encoding/json"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/gorilla/mux"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

type GreetingRequest struct {
    Name string `json:"name"`
}

type GreetingResponse struct {
    Message string `json:"message"`
}

type Server struct {
    router *mux.Router
    tracer trace.Tracer
}

func NewServer() *Server {
    s := &Server{
        router: mux.NewRouter(),
        tracer: otel.Tracer("greeting-service"),
    }
    s.routes()
    return s
}

func (s *Server) routes() {
    s.router.HandleFunc("/api/greeting", s.handleGreeting).Methods("POST")
    s.router.HandleFunc("/healthz", s.handleHealth).Methods("GET")
    s.router.HandleFunc("/ready", s.handleReady).Methods("GET")
}

func (s *Server) handleGreeting(w http.ResponseWriter, r *http.Request) {
    ctx, span := s.tracer.Start(r.Context(), "handleGreeting")
    defer span.End()

    var req GreetingRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }

    resp := GreetingResponse{
        Message: "Hello, " + req.Name + "!",
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(resp)
}

func (s *Server) handleHealth(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

func (s *Server) handleReady(w http.ResponseWriter, r *http.Request) {
    // Check if dependencies are ready
    // For now, just return OK
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("READY"))
}

func (s *Server) Start(addr string) error {
    srv := &http.Server{
        Addr:         addr,
        Handler:      s.router,
        ReadTimeout:  15 * time.Second,
        WriteTimeout: 15 * time.Second,
        IdleTimeout:  60 * time.Second,
    }

    // Graceful shutdown
    go func() {
        sigint := make(chan os.Signal, 1)
        signal.Notify(sigint, os.Interrupt, syscall.SIGTERM)
        <-sigint

        log.Println("Shutting down server...")

        ctx, cancel := context.WithTimeout(context.Background(), 40*time.Second)
        defer cancel()

        if err := srv.Shutdown(ctx); err != nil {
            log.Printf("Server shutdown error: %v", err)
        }
    }()

    log.Printf("Starting server on %s", addr)
    return srv.ListenAndServe()
}

func main() {
    server := NewServer()
    if err := server.Start(":8080"); err != nil && err != http.ErrServerClosed {
        log.Fatalf("Server failed: %v", err)
    }
}

Dockerfile:

FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /server

FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /server .
EXPOSE 8080
CMD ["./server"]

docker-compose.yml:

version: '3.8'
services:
  app:
    build: .
    ports:
      - "8080:8080"
    environment:
      - ENV=production
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:8080/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3

To write that “Hello World” application, a new developer needs to understand:

Languages & SyntaxConcepts & PatternsTools & FrameworksInfrastructure & Deployment
Go language (types, interfaces, goroutines, channels)HTTP request/response cycleGo modules for dependency managementContainer concepts
JSON for data serializationRESTful API designGorilla Mux (or similar router)Multi-stage Docker builds
YAML for configurationContext propagationOpenTelemetry for observabilityPort mapping
Dockerfile syntaxGraceful shutdownDocker for containerizationHealth checks
Health checks and readiness probesDocker Compose for local orchestrationEnvironment variables
Structured loggingBuild vs runtime separation
Distributed tracing
Signal handling (SIGTERM, SIGINT)

Total concepts to learn: 27 (just to write a “Hello World” service)

And we haven’t even added:

  • Database integration
  • Authentication/authorization
  • Testing frameworks
  • CI/CD pipelines
  • Kubernetes deployment
  • Service mesh configuration
  • Monitoring and alerting
  • Rate limiting
  • Circuit breakers

The Framework Treadmill

When I started, learning a language meant learning THE language. You learned C, and that knowledge was good for decades. Today in the Go ecosystem alone, you need to choose between:

Web FrameworksORM/Database LibrariesConfiguration ManagementLogging
net/http (standard library – minimal)database/sql (standard library)Viperlog (standard library)
Gin (fast, minimalist)GORM (full-featured ORM)envconfiglogrus
Echo (feature-rich)sqlx (extensions to database/sql)figenvzap
Fiber (Express-inspired)sqlc (generates type-safe code from SQL)kongzerolog
Chi (lightweight, composable)ent (entity framework)
Gorilla (toolkit of packages)

Each choice cascades into more choices:

  • “We use Gin with GORM, configured via Viper, logging with zap, deployed on Kubernetes with Istio, monitored with Prometheus and Grafana, traced with Jaeger, with CI/CD through GitHub Actions and ArgoCD.”

Junior developers need to learn 10+ tools/frameworks just to contribute their first line of code.

The Lost Art of Understanding the Stack

The Full Stack Illusion

We celebrate “full stack developers,” but what we often have are “full abstraction developers”—people who know frameworks but not fundamentals.

I’ve interviewed candidates who could build a Go microservice but couldn’t explain:

  • How HTTP actually works
  • What happens when you type a URL in a browser
  • How a database index speeds up queries
  • Why you’d choose TCP vs UDP
  • What DNS resolution is
  • How TLS handshakes work

They knew how to use the net/http package, but not what an HTTP request actually contains. They knew how to deploy to AWS, but not what happens when their code runs.

The Layers of Ignorance

Here’s what a request traverses, and how much the average developer knows about each layer:

Developers understand 3-4 layers out of 15+. The rest is abstraction they trust blindly.

When Abstractions Break: The Debugging Nightmare

This shallow understanding becomes catastrophic during outages:

Incident: “API is slow, requests timing out”

Junior developer’s debugging process:

  1. Check application logs – nothing obvious
  2. Check if code changed recently – no
  3. Ask in Slack – no one knows
  4. Create “high priority” ticket
  5. Wait for senior engineer

Senior engineer’s debugging process:

  1. Check Go runtime metrics (goroutine leaks, GC pauses)
  2. Check database query performance with EXPLAIN
  3. Check database connection pool saturation
  4. Check network latency to database
  5. Check if database indexes missing
  6. Check Kubernetes pod resource limits (CPU throttling?)
  7. Check if auto-scaling triggered
  8. Check service mesh retry storms
  9. Check load balancer distribution
  10. Check if upstream dependencies slow
  11. Check for DNS resolution issues
  12. Check certificate expiration
  13. Check rate limiting configuration
  14. Use pprof to profile the actual code
  15. Find the issue (connection pool exhausted because MaxOpenConns was too low)

The senior engineer has mechanical empathy—they understand the full stack from code to silicon. The junior engineer knows frameworks but not fundamentals.

The Hardware Layer Amnesia

When I learned programming, we understood hardware constraints:

1980s mindset:

  • “This loop will execute 1000 times, that’s 1000 memory accesses”
  • “Disk I/O is 1000x slower than RAM”
  • “Network calls are 100x slower than disk”

Modern mindset:

  • “Just call the API”
  • “Just query the database”
  • “Just iterate over this slice”

No thought about:

  • CPU cache locality
  • Memory allocations and GC pressure
  • Network round trips
  • Database query plans
  • Disk I/O patterns

Example 1: The GraphQL Resolver Nightmare

GraphQL promises elegant APIs where clients request exactly what they need. But the implementation often creates performance disasters:

// GraphQL resolver - looks clean!
type UserResolver struct {
    userRepo     *UserRepository
    postRepo     *PostRepository
    commentRepo  *CommentRepository
    followerRepo *FollowerRepository
}

func (r *UserResolver) User(ctx context.Context, args struct{ ID string }) (*User, error) {
    return r.userRepo.GetByID(ctx, args.ID)
}

func (r *UserResolver) Posts(ctx context.Context, user *User) ([]*Post, error) {
    // Called for EACH user!
    return r.postRepo.GetByUserID(ctx, user.ID)
}

func (r *UserResolver) Comments(ctx context.Context, user *User) ([]*Comment, error) {
    // Called for EACH user!
    return r.commentRepo.GetByUserID(ctx, user.ID)
}

func (r *UserResolver) Followers(ctx context.Context, user *User) ([]*Follower, error) {
    // Called for EACH user!
    return r.followerRepo.GetByUserID(ctx, user.ID)
}

Client queries this seemingly simple GraphQL:

query {
  users(limit: 100) {
    id
    name
    posts { title }
    comments { text }
    followers { name }
  }
}

What actually happens:

1 query:  SELECT * FROM users LIMIT 100
100 queries: SELECT * FROM posts WHERE user_id = ? (one per user)
100 queries: SELECT * FROM comments WHERE user_id = ? (one per user)
100 queries: SELECT * FROM followers WHERE user_id = ? (one per user)

Total: 301 database queries
Latency: 100ms (DB) × 301 = 30+ seconds!

The developer thought they built an elegant API. They created a performance catastrophe. Mechanical empathy would have recognized this N+1 pattern immediately.

The fix requires understanding data loading patterns:

// Use DataLoader to batch requests
type UserResolver struct {
    userLoader     *dataloader.Loader
    postLoader     *dataloader.Loader
    commentLoader  *dataloader.Loader
    followerLoader *dataloader.Loader
}

func (r *UserResolver) Posts(ctx context.Context, user *User) ([]*Post, error) {
    // Batches all user IDs, makes ONE query
    thunk := r.postLoader.Load(ctx, dataloader.StringKey(user.ID))
    return thunk()
}

// Batch function - called once with all user IDs
func batchGetPosts(ctx context.Context, keys dataloader.Keys) []*dataloader.Result {
    userIDs := keys.Keys()
    // Single query: SELECT * FROM posts WHERE user_id IN (?, ?, ?, ...)
    posts, err := repo.GetByUserIDs(ctx, userIDs)
    // Group by user_id and return
    return groupPostsByUser(posts, userIDs)
}

// Now: 4 queries total instead of 301

Example 2: The Permission Filtering Disaster

Another pattern I see constantly: fetching all data first, then filtering by permissions in memory.

// WRONG: Fetch everything, filter in application
func (s *DocumentService) GetUserDocuments(ctx context.Context, userID string) ([]*Document, error) {
    // Fetch ALL documents from database
    allDocs, err := s.repo.GetAllDocuments(ctx)
    if err != nil {
        return nil, err
    }
    
    // Filter in application memory
    var userDocs []*Document
    for _, doc := range allDocs {
        // Check permissions for each document
        if s.hasPermission(ctx, userID, doc.ID) {
            userDocs = append(userDocs, doc)
        }
    }
    
    return userDocs, nil
}

func (s *DocumentService) hasPermission(ctx context.Context, userID, docID string) bool {
    // ANOTHER database call for EACH document!
    perms, _ := s.permRepo.GetPermissions(ctx, docID)
    for _, perm := range perms {
        if perm.UserID == userID {
            return true
        }
    }
    return false
}

What happens with 10,000 documents in the system:

1 query:     SELECT * FROM documents (returns 10,000 rows)
10,000 queries: SELECT * FROM permissions WHERE document_id = ?

Database returns: 10,000 documents × average 2KB = 20MB over network
User can access: 5 documents
Result sent to client: 10KB

Waste: 20MB network transfer, 10,001 queries, ~100 seconds latency

Someone with mechanical empathy would filter at the database:

// CORRECT: Filter at database level
func (s *DocumentService) GetUserDocuments(ctx context.Context, userID string) ([]*Document, error) {
    query := `
        SELECT DISTINCT d.*
        FROM documents d
        INNER JOIN permissions p ON d.id = p.document_id
        WHERE p.user_id = ?
    `
    
    var docs []*Document
    err := s.db.Select(&docs, query, userID)
    return docs, err
}

// Result: 1 query, returns only 5 documents, 10KB transfer, <100ms latency

Example 3: Memory Allocation Blindness

Another common pattern—unnecessary allocations:

// Creates a new string on every iteration
func BuildMessage(names []string) string {
    message := ""
    for _, name := range names {
        message += "Hello, " + name + "! "  // Each += allocates new string
    }
    return message
}

// With 1000 names, this creates 1000 intermediate strings
// GC pressure increases
// Performance degrades

Someone with mechanical empathy would write:

// Uses strings.Builder which pre-allocates and reuses memory
func BuildMessage(names []string) string {
    var builder strings.Builder
    builder.Grow(len(names) * 20)  // Pre-allocate approximate size
    
    for _, name := range names {
        builder.WriteString("Hello, ")
        builder.WriteString(name)
        builder.WriteString("! ")
    }
    return builder.String()
}

// With 1000 names, this does 1 allocation

The difference? Understanding memory allocation and garbage collection pressure.

The Coordination Nightmare

Let me show you a real problem I encountered repeatedly in production.

The Shutdown Race Condition

Here’s what should happen when Kubernetes shuts down a pod:

  1. Kubernetes sends SIGTERM to the pod
  2. Readiness probe immediately fails (stops receiving traffic)
  3. Application drains in-flight requests
  4. Istio sidecar waits for active connections to complete
  5. Everything shuts down cleanly

Here’s what actually happens when you misconfigure the timeouts:

Here’s the Go code that handles shutdown:

func main() {
    server := NewServer()
    
    // Channel to listen for interrupt signals
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    
    // Start server in goroutine
    go func() {
        log.Printf("Starting server on :8080")
        if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatalf("Server error: %v", err)
        }
    }()
    
    // Wait for interrupt signal
    <-quit
    log.Println("Shutting down server...")
    
    // CRITICAL: This timeout must be less than terminationGracePeriodSeconds
    // and less than Istio's terminationDrainDuration
    ctx, cancel := context.WithTimeout(context.Background(), 40*time.Second)
    defer cancel()
    
    if err := server.Shutdown(ctx); err != nil {
        log.Fatalf("Server forced to shutdown: %v", err)
    }
    
    log.Println("Server exited")
}

The fix requires coordinating multiple timeout values across different layers:

# Kubernetes Deployment
spec:
  template:
    metadata:
      annotations:
        # Istio waits for connections to drain for 45 seconds
        sidecar.istio.io/terminationDrainDuration: "45s"
    spec:
      containers:
      - name: app
        lifecycle:
          preStop:
            exec:
              # Sleep 15 seconds to allow load balancer updates to propagate
              command: ["/bin/sh", "-c", "sleep 15"]
      # Kubernetes waits 65 seconds before sending SIGKILL
      terminationGracePeriodSeconds: 65

Why these specific numbers?

Total grace period: 65 seconds (Kubernetes level)

Timeline:
0s:  SIGTERM sent
0s:  preStop hook runs (sleeps 15s) - allows LB updates
15s: preStop completes, SIGTERM reaches application
15s: Application begins graceful shutdown (max 40s in code)
55s: Application should be done (15s preStop + 40s app shutdown)
65s: Istio sidecar terminates (has been draining since 0s)
65s: If anything is still running, SIGKILL

Istio drain: 45s (must be < 65s total grace period)
App shutdown: 40s (must be < 45s Istio drain)
PreStop delay: 15s (for load balancer updates)
Buffer: 10s (for safety: 15 + 40 + 10 = 65)

Get any of these wrong, and your service drops requests or deadlocks during deployments.

The Startup Coordination Problem

Here’s another incident pattern:

func main() {
    log.Println("Application starting...")
    
    // Connect to auth service
    authConn, err := grpc.Dial(
        "auth-service:50051",
        grpc.WithInsecure(),
        grpc.WithBlock(),  // Wait for connection
        grpc.WithTimeout(5*time.Second),
    )
    if err != nil {
        log.Fatalf("Failed to connect to auth service: %v", err)
    }
    defer authConn.Close()
    
    log.Println("Connected to auth service")
    // ... rest of startup
}

The logs show:

[2024-01-15 10:23:15] Application starting...
[2024-01-15 10:23:15] Failed to connect to auth service: 
    context deadline exceeded
[2024-01-15 10:23:15] Application exit code: 1
[2024-01-15 10:23:16] Pod restarting (CrashLoopBackOff)

What happened? The application container started before the Istio sidecar was ready. The application tried to make an outbound gRPC call, but there was no network proxy yet.

The fix:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
spec:
  template:
    metadata:
      annotations:
        # Critical annotation - wait for Istio proxy to be ready
        sidecar.istio.io/holdApplicationUntilProxyStarts: "true"

But here’s the thing: this annotation was missing from 93% of services in one production environment I analyzed. Why? Because:

  • It’s not the default
  • It’s easy to forget
  • The error only happens during pod startup
  • It might work in development (no Istio) but fail in production

The cognitive load is crushing. Developers need to remember:

  • Istio startup annotations
  • Kubernetes probe configurations
  • Application shutdown timeouts
  • Database connection pool settings
  • gRPC keepalive settings
  • Load balancer health check requirements

Any one of these, misconfigured, causes production incidents.

Network Hops: The Hidden Tax

Every network hop adds more than just latency. Let me break down what actually happens:

The Anatomy of a Network Call

When your Go code makes a simple HTTP request:

resp, err := http.Get("https://api.example.com/users")
if err != nil {
    return err
}
defer resp.Body.Close()

Here’s what actually happens:

1. DNS Resolution (10-100ms)

2. TCP Connection (30-100ms for new connection)

3. TLS Handshake (50-200ms for new connection)

4. HTTP Request (actual request time)

5. Connection Reuse or Teardown

Total time for a “simple” API call: 100-500ms before your code even executes.

Now multiply this by your architecture:

Nine network hops for what should be one database query.

Each hop adds:

  • Latency: 1-10ms minimum per hop (P50), 10-100ms (P99)
  • Failure probability: If each hop is 99.9% reliable, nine hops = 99.1% reliability
  • Serialization overhead: JSON/Protobuf encoding/decoding at each boundary
  • Authentication/authorization: Each service validates tokens
  • Logging overhead: Each layer logs the request
  • Monitoring overhead: Each layer emits metrics
  • Retry logic: Each layer might retry on failure

Let me show you how this looks in Go code:

// Service A
func (s *ServiceA) ProcessOrder(ctx context.Context, orderID string) error {
    // Network hop 1: Call auth service
    authClient := pb.NewAuthServiceClient(s.authConn)
    authResp, err := authClient.ValidateToken(ctx, &pb.ValidateRequest{
        Token: getTokenFromContext(ctx),
    })
    if err != nil {
        return fmt.Errorf("auth failed: %w", err)
    }
    
    // Network hop 2: Call inventory service
    invClient := pb.NewInventoryServiceClient(s.inventoryConn)
    invResp, err := invClient.CheckStock(ctx, &pb.StockRequest{
        OrderID: orderID,
    })
    if err != nil {
        return fmt.Errorf("inventory check failed: %w", err)
    }
    
    // Network hop 3: Call payment service
    payClient := pb.NewPaymentServiceClient(s.paymentConn)
    payResp, err := payClient.ProcessPayment(ctx, &pb.PaymentRequest{
        OrderID: orderID,
        Amount:  invResp.TotalPrice,
    })
    if err != nil {
        return fmt.Errorf("payment failed: %w", err)
    }
    
    // Network hop 4: Save to database
    _, err = s.db.ExecContext(ctx, 
        "INSERT INTO orders (id, status) VALUES (?, ?)",
        orderID, "completed",
    )
    if err != nil {
        return fmt.Errorf("database save failed: %w", err)
    }
    
    return nil
}

// Each of those function calls crosses multiple network boundaries:
// ServiceA ? Istio sidecar ? Istio ingress ? Target service ? Target's sidecar ? Target code

The Retry Storm

Here’s a real incident pattern I’ve debugged:

// API Gateway configuration
client := &http.Client{
    Timeout: 30 * time.Second,
    Transport: &retryTransport{
        maxRetries: 3,
        backoff:    100 * time.Millisecond,
    },
}

// Service A configuration  
grpcClient := grpc.Dial(
    "service-b:50051",
    grpc.WithUnaryInterceptor(grpcretry.UnaryClientInterceptor(
        grpcretry.WithMax(2),
        grpcretry.WithBackoff(grpcretry.BackoffLinear(100*time.Millisecond)),
    )),
)

// Service B configuration
dbClient := &sql.DB{
    MaxOpenConns: 10,
    MaxIdleConns: 5,
}
// With retry logic in ORM
db.AutoMigrate(&User{}).
    Session(&gorm.Session{
        PrepareStmt: true,
        RetryOnConflict: 2,
    })

Here’s what happens:

One user request became 12 database queries due to cascading retries.

If 100 users hit this endpoint simultaneously:

  • API Gateway sees: 100 requests
  • Service A sees: 300 requests (3x due to API gateway retries)
  • Service B sees: 600 requests (2x more retries from Service A)
  • Database sees: 1200 queries (2x more retries from Service B)

The database melts down, not from actual load, but from retry amplification.

The Latency Budget Illusion

Your SLA says “99% of requests under 500ms.” Let’s see how you spend that budget:

You’ve blown your latency budget before your code even runs if the pod is cold-starting.

This is why you see mysterious timeout patterns:

  • First request after deployment: 2-3 seconds
  • Next requests: 200-300ms
  • After scaling up: Some pods hit, some miss (inconsistent latency)

The Debugging Multiplication

When something goes wrong, you need to check logs at every layer:

# 1. Check API Gateway logs
kubectl logs -n gateway api-gateway-7d8f9-xyz

# 2. Check Istio Ingress Gateway logs
kubectl logs -n istio-system istio-ingressgateway-abc123

# 3. Check your application pod logs
kubectl logs -n production user-service-8f7d6-xyz

# 4. Check Istio sidecar logs (same pod, different container)
kubectl logs -n production user-service-8f7d6-xyz -c istio-proxy

# 5. Check downstream service logs
kubectl logs -n production auth-service-5g4h3-def

# 6. Check downstream service's sidecar
kubectl logs -n production auth-service-5g4h3-def -c istio-proxy

# 7. Check database logs (if you have access)
# Usually in a different system entirely

# 8. Check cloud load balancer logs
# In AWS CloudWatch / GCP Cloud Logging / Azure Monitor

# 9. Check CDN logs
# In CloudFlare/Fastly/Akamai dashboard

You need access to 9+ different log sources. Each with:

  • Different query syntaxes
  • Different retention periods
  • Different access controls
  • Different time formats
  • Different log levels
  • Different structured logging formats

Now multiply this by the fact that logs aren’t synchronized—each system has clock drift. Correlating events requires:

// Propagating trace context through every layer
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/trace"
)

func HandleRequest(w http.ResponseWriter, r *http.Request) {
    // Extract trace context from incoming request
    ctx := otel.GetTextMapPropagator().Extract(
        r.Context(),
        propagation.HeaderCarrier(r.Header),
    )
    
    // Start a new span
    tracer := otel.Tracer("user-service")
    ctx, span := tracer.Start(ctx, "HandleRequest")
    defer span.End()
    
    // Propagate to downstream calls
    req, _ := http.NewRequestWithContext(ctx, "GET", "http://auth-service/validate", nil)
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
    
    // Make the call
    resp, err := http.DefaultClient.Do(req)
    // ...
}

And this is just for distributed tracing. You also need:

  • Request IDs (different from trace IDs)
  • User IDs (for user-specific debugging)
  • Session IDs (for session tracking)
  • Correlation IDs (for async operations)

Each must be propagated through every layer, logged at every step, and indexed in your log aggregation system.

Logical vs Physical Layers: The Diagnosis Problem

There’s a critical distinction between logical abstraction (like modular code architecture) and physical abstraction (like network boundaries).

Logical layers add cognitive complexity but don’t add latency:

// Controller layer
func (c *UserController) GetUser(w http.ResponseWriter, r *http.Request) {
    userID := mux.Vars(r)["id"]
    user, err := c.service.GetUser(r.Context(), userID)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    json.NewEncoder(w).Encode(user)
}

// Service layer
func (s *UserService) GetUser(ctx context.Context, id string) (*User, error) {
    return s.repo.FindByID(ctx, id)
}

// Repository layer
func (r *UserRepository) FindByID(ctx context.Context, id string) (*User, error) {
    var user User
    err := r.db.GetContext(ctx, &user, "SELECT * FROM users WHERE id = ?", id)
    return &user, err
}

This is three logical layers (Controller ? Service ? Repository) but zero network hops. Everything runs in the same process. Debugging is straightforward—add breakpoints or log statements.

Physical layers add both complexity AND latency:

// Service A
func (s *ServiceA) ProcessOrder(ctx context.Context, orderID string) error {
    // Physical layer 1: Network call to auth service
    if err := s.authClient.Validate(ctx); err != nil {
        return err
    }
    
    // Physical layer 2: Network call to inventory service
    items, err := s.inventoryClient.GetItems(ctx, orderID)
    if err != nil {
        return err
    }
    
    // Physical layer 3: Network call to payment service
    if err := s.paymentClient.Charge(ctx, items.Total); err != nil {
        return err
    }
    
    // Physical layer 4: Network call to database
    return s.db.SaveOrder(ctx, orderID)
}

Each physical layer adds:

  • Network latency: 1-100ms per call
  • Network failures: timeouts, connection refused, DNS failures
  • Serialization: Marshal/unmarshal data (CPU + memory)
  • Authentication: Validate tokens/certificates
  • Observability overhead: Logging, metrics, tracing

When I started my career, debugging meant checking if the database query was slow. Now it means:

  1. Check if the request reached the API gateway (CloudWatch logs, different AWS account)
  2. Check if authentication passed (Auth service logs, different namespace)
  3. Check if rate limiting triggered (API gateway metrics)
  4. Check if the service mesh routed correctly (Istio access logs)
  5. Check if Kubernetes readiness probes passed (kubectl events)
  6. Check if the application pod received the request (app logs, may be on a different node)
  7. Check if the sidecar proxy was ready (istio-proxy logs)
  8. Check if downstream services responded (distributed tracing in Jaeger)
  9. Check database query performance (database slow query log)
  10. Finally check if your actual code has a bug (pprof, debugging)

My professor back in college taught us to use binary search for debugging—cut the problem space in half with each test. But when you have 10+ layers, you can’t easily bisect. You need:

  • Centralized log aggregation (ELK, Splunk, Loki)
  • Distributed tracing with correlation IDs (Jaeger, Zipkin)
  • Service mesh observability (Kiali, Grafana)
  • APM (Application Performance Monitoring) tools (Datadog, New Relic)
  • Kubernetes event logging
  • Network traffic analysis (Wireshark, tcpdump)

And this is for a service that just saves data to a database.

The Dependency Explosion: Transitive Complexity

The Go Modules Reality

There’s a famous joke that node_modules is the heaviest object in the universe. Go modules are lighter, but the problem persists:

$ go mod init myapp
$ go get github.com/gin-gonic/gin
$ go get gorm.io/gorm
$ go get gorm.io/driver/postgres
$ go get go.uber.org/zap
$ go get github.com/spf13/viper

$ go mod graph | wc -l
247

$ go list -m all
myapp
github.com/gin-gonic/gin v1.9.1
github.com/gin-contrib/sse v0.1.0
github.com/go-playground/validator/v10 v10.14.0
github.com/goccy/go-json v0.10.2
github.com/json-iterator/go v1.1.12
github.com/mattn/go-isatty v0.0.19
github.com/pelletier/go-toml/v2 v2.0.8
github.com/ugorji/go/codec v1.2.11
golang.org/x/net v0.10.0
golang.org/x/sys v0.8.0
golang.org/x/text v0.9.0
google.golang.org/protobuf v1.30.0
gopkg.in/yaml.v3 v3.0.1
... (234 more)

247 dependencies for a “simple” web service.

Let’s visualize what you’re actually depending on:

myapp
|-- github.com/gin-gonic/gin v1.9.1
|   |-- github.com/gin-contrib/sse v0.1.0
|   |-- github.com/go-playground/validator/v10 v10.14.0
|   |   |-- github.com/go-playground/universal-translator v0.18.1
|   |   |-- github.com/leodido/go-urn v1.2.4
|   |   +-- golang.org/x/crypto v0.9.0
|   |-- github.com/goccy/go-json v0.10.2
|   |-- github.com/json-iterator/go v1.1.12
|   |   |-- github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd
|   |   +-- github.com/modern-go/reflect2 v1.0.2
|   +-- ... (15 more)
|-- gorm.io/gorm v1.25.2
|   |-- github.com/jinzhu/inflection v1.0.0
|   |-- github.com/jinzhu/now v1.1.5
|   +-- ... (8 more)
|-- gorm.io/driver/postgres v1.5.2
|   |-- github.com/jackc/pgx/v5 v5.3.1
|   |   |-- github.com/jackc/pgpassfile v1.0.0
|   |   |-- github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a
|   |   +-- ... (12 more)
|   +-- ... (5 more)
+-- ... (200+ more)

Total unique packages: 247

The Update Nightmare

Now imagine you need to update one dependency:

$ go get -u github.com/gin-gonic/gin

go: github.com/gin-gonic/gin@v1.10.0 requires
    github.com/go-playground/validator/v10@v10.16.0 requires
        golang.org/x/crypto@v0.15.0 requires
            golang.org/x/sys@v0.14.0

go: myapp@v0.0.0 requires
    github.com/some-old-package@v1.2.3 requires
        golang.org/x/sys@v0.8.0

go: github.com/some-old-package@v1.2.3 is incompatible with golang.org/x/sys@v0.14.0

Translation: “One of your dependencies requires an older version of golang.org/x/sys that’s incompatible with what Gin needs. You’re stuck until some-old-package updates.”

Your options:

  1. Don’t upgrade (stay vulnerable to any security issues)
  2. Fork some-old-package and update it yourself
  3. Find an alternative library (and rewrite code)
  4. Use replace directive in go.mod (and hope nothing breaks)
// go.mod
module myapp

go 1.21

require (
    github.com/gin-gonic/gin v1.10.0
    github.com/some-old-package v1.2.3
)

// Force using compatible version (dangerous)
replace github.com/some-old-package => github.com/some-old-package v1.2.4-compatible

The Supply Chain Attack Surface

Every dependency is a potential security vulnerability:

Real incidents in the Go ecosystem:

  • github.com/golang/protobuf: Multiple CVEs requiring version updates
  • golang.org/x/crypto: SSH vulnerabilities requiring immediate patches
  • golang.org/x/net/http2: HTTP/2 rapid reset attack (CVE-2023-39325)
  • github.com/docker/docker: Container escape vulnerabilities
  • Compromised GitHub accounts: Attackers gaining access to maintainer accounts

The attack vectors:

  1. Direct compromise: Attacker gains push access to repository
  2. Typosquatting: Package named github.com/gin-gonig/gin vs github.com/gin-gonic/gin
  3. Dependency confusion: Internal package name conflicts with public one
  4. Transitive attacks: Compromise a dependency of a popular package
  5. Maintainer burnout: Unmaintained packages become vulnerable over time

Let’s say you’re using this Go code:

import (
    "github.com/gin-gonic/gin"
    _ "github.com/lib/pq"  // PostgreSQL driver
    "gorm.io/gorm"
)

You’re trusting:

  • The Gin framework maintainers (and their 15 dependencies)
  • The PostgreSQL driver maintainers
  • The GORM maintainers (and their 8 dependencies)
  • All their transitive dependencies (200+ packages)
  • The Go standard library maintainers
  • The Go module proxy (proxy.golang.org)
  • GitHub’s infrastructure
  • Your company’s internal proxy/mirror
  • The TLS certificate authorities

Any of these could be compromised, introducing malicious code into your application.

The Compatibility Matrix from Hell

Dependency upgrades create cascading nightmares. Upgrading Go from 1.20 to 1.21 means checking all 247 transitive dependencies for compatibility—their go.mod files, CI configs, and issue trackers. Inevitably, conflicts emerge: Package A supports Go 1.18-1.21, but Package B only works with 1.16-1.19 and hasn’t been updated in two years. Package C requires golang.org/x/sys v0.8.0, but Package A needs v0.14.0. Your simple upgrade becomes a multi-day investigation of what to fork, replace, or rewrite.

I’ve seen this pattern repeatedly: upgrading one dependency triggers a domino effect. A security patch in a logging library forces an HTTP framework update, which needs a new database driver, which conflicts with your metrics library. Each brings breaking API changes requiring code modifications.

You can’t ignore these updates. When a critical CVE drops, you have hours to patch. But that “simple” security fix might be incompatible with your stack, forcing emergency upgrades across everything while production is vulnerable.

The maintenance cost is relentless. Teams spend 20-30% of development time managing dependencies—reviewing Dependabot PRs, testing compatibility, fixing breaking changes. It’s a treadmill you can never leave. The alternative—pinning versions and ignoring updates—accumulates technical debt requiring eventual massive, risky “dependency catch-up” projects.

Every imported package adds to an ever-growing compatibility matrix that no human can fully comprehend. Each combination potentially has different bugs or incompatibilities—multiply this across Go versions, architectures (amd64/arm64), operating systems, CGO settings, race detector modes, and build tags.

The Continuous Vulnerability Treadmill

Using Dependabot or similar tools:

Week 1:
- 3 security vulnerabilities found
- Update github.com/gin-gonic/gin
- Update golang.org/x/net
- Update golang.org/x/crypto

Week 2:
- 2 new security vulnerabilities found
- Update gorm.io/gorm
- Update github.com/lib/pq

Week 3:
- 5 new security vulnerabilities found
- Update breaks API compatibility
- Spend 2 days fixing breaking changes
- Deploy, monitor, rollback, fix, deploy again

Week 4:
- 4 new security vulnerabilities found
- Team exhausted from constant updates
- Security team pressuring for compliance
- Product team pressuring for features

This never ends. The security treadmill is a permanent feature of modern software development.

I’ve seen teams that:

  • Spend 30% of development time updating dependencies
  • Have dozens of open Dependabot PRs that no one reviews
  • Pin all versions and ignore security updates (dangerous)
  • Create “update weeks” where the entire team does nothing but update dependencies

The Observability Complexity Tax

To manage all this complexity, we added… more complexity.

The Three Pillars (That Became Five)

The observability industry says you need:

  1. Metrics (Prometheus, Datadog, CloudWatch)
  2. Logs (ELK stack, Loki, Splunk)
  3. Traces (Jaeger, Zipkin, Tempo)
  4. Profiles (pprof, continuous profiling)
  5. Events (error tracking, alerting)

Each requires:

  • Installation (agents, sidecars, instrumentation)
  • Configuration (what to collect, retention, sampling)
  • Integration (SDK, auto-instrumentation, manual instrumentation)
  • Storage (expensive, grows infinitely)
  • Querying (learning PromQL, LogQL, TraceQL)
  • Alerting (thresholds, routing, escalation)
  • Cost management (easily $10K-$100K+ per month)

The Instrumentation Tax

To get observability, you instrument your code:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
    "go.uber.org/zap"
)

func (s *OrderService) ProcessOrder(ctx context.Context, order *Order) error {
    // Start tracing span
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "ProcessOrder",
        trace.WithAttributes(
            attribute.String("order.id", order.ID),
            attribute.Float64("order.total", order.Total),
            attribute.String("user.id", order.UserID),
        ),
    )
    defer span.End()
    
    // Log the start
    s.logger.Info("Processing order",
        zap.String("order_id", order.ID),
        zap.Float64("total", order.Total),
        zap.String("user_id", order.UserID),
    )
    
    // Increment metric
    s.metrics.OrdersProcessed.Inc()
    
    // Start timer for duration metric
    timer := s.metrics.OrderProcessingDuration.Start()
    defer timer.ObserveDuration()
    
    // === Actual business logic starts here ===
    
    // Validate order (with nested span)
    ctx, validateSpan := tracer.Start(ctx, "ValidateOrder")
    if err := s.validator.Validate(ctx, order); err != nil {
        validateSpan.SetStatus(codes.Error, err.Error())
        validateSpan.RecordError(err)
        validateSpan.End()
        
        s.logger.Error("Order validation failed",
            zap.String("order_id", order.ID),
            zap.Error(err),
        )
        s.metrics.OrderValidationFailures.Inc()
        
        span.SetStatus(codes.Error, err.Error())
        span.RecordError(err)
        return fmt.Errorf("validation failed: %w", err)
    }
    validateSpan.End()
    
    // Process payment (with nested span)
    ctx, paymentSpan := tracer.Start(ctx, "ProcessPayment",
        trace.WithAttributes(
            attribute.Float64("payment.amount", order.Total),
        ),
    )
    if err := s.paymentClient.Charge(ctx, order.Total); err != nil {
        paymentSpan.SetStatus(codes.Error, err.Error())
        paymentSpan.RecordError(err)
        paymentSpan.End()
        
        s.logger.Error("Payment processing failed",
            zap.String("order_id", order.ID),
            zap.Float64("amount", order.Total),
            zap.Error(err),
        )
        s.metrics.PaymentFailures.Inc()
        
        span.SetStatus(codes.Error, err.Error())
        span.RecordError(err)
        return fmt.Errorf("payment failed: %w", err)
    }
    paymentSpan.End()
    
    // Update inventory (with nested span)
    ctx, inventorySpan := tracer.Start(ctx, "UpdateInventory")
    if err := s.inventoryClient.Reserve(ctx, order.Items); err != nil {
        inventorySpan.SetStatus(codes.Error, err.Error())
        inventorySpan.RecordError(err)
        inventorySpan.End()
        
        s.logger.Error("Inventory update failed",
            zap.String("order_id", order.ID),
            zap.Error(err),
        )
        s.metrics.InventoryFailures.Inc()
        
        span.SetStatus(codes.Error, err.Error())
        span.RecordError(err)
        
        // Compensating transaction: refund payment
        if refundErr := s.paymentClient.Refund(ctx, order.Total); refundErr != nil {
            s.logger.Error("Refund failed during compensation",
                zap.String("order_id", order.ID),
                zap.Error(refundErr),
            )
        }
        
        return fmt.Errorf("inventory failed: %w", err)
    }
    inventorySpan.End()
    
    // === Actual business logic ends here ===
    
    // Log success
    s.logger.Info("Order processed successfully",
        zap.String("order_id", order.ID),
    )
    
    // Record metrics
    s.metrics.OrdersSuccessful.Inc()
    s.metrics.OrderValue.Observe(order.Total)
    
    // Set span status
    span.SetStatus(codes.Ok, "Order processed")
    
    return nil
}

Count the instrumentation code vs business logic:

  • Lines of business logic: ~15
  • Lines of instrumentation: ~85
  • Ratio: 1:5.6

Instrumentation code is 5.6x larger than business logic. And this is a simplified example. Real production code has:

  • Metrics collection (counters, gauges, histograms)
  • Structured logging (with correlation IDs, user IDs, session IDs)
  • Custom span attributes
  • Error tracking integration
  • Performance profiling
  • Security audit logging

The business logic disappears in the observability boilerplate.

Compare this to how I wrote code in the 1990s:

// C code from 1995
int process_order(Order *order) {
    if (!validate_order(order)) {
        return ERROR_VALIDATION;
    }
    
    if (!charge_payment(order->total)) {
        return ERROR_PAYMENT;
    }
    
    if (!update_inventory(order->items)) {
        refund_payment(order->total);
        return ERROR_INVENTORY;
    }
    
    return SUCCESS;
}

12 lines. No instrumentation. Easy to understand. When something went wrong, you looked at error codes and maybe some log files.

Was it harder to debug? Sometimes. But the code was simpler, and the system had fewer moving parts.

The Path Forward: Pragmatic Abstraction

I’m not suggesting we abandon abstraction and return to writing assembly language. But we need to apply abstraction more judiciously:

1. Start Concrete, Refactor to Abstract

Follow the Rule of Three: write it once, write it twice, refactor on the third time. This ensures your abstraction is based on actual patterns, not speculative ones.

// First time: Write it directly
func GetUser(db *sql.DB, userID string) (*User, error) {
    var user User
    err := db.QueryRow(
        "SELECT id, name, email FROM users WHERE id = ?",
        userID,
    ).Scan(&user.ID, &user.Name, &user.Email)
    return &user, err
}

// Second time: Still write it directly
func GetOrder(db *sql.DB, orderID string) (*Order, error) {
    var order Order
    err := db.QueryRow(
        "SELECT id, user_id, total FROM orders WHERE id = ?",
        orderID,
    ).Scan(&order.ID, &order.UserID, &order.Total)
    return &order, err
}

// Third time: Now abstract
type Repository struct {
    db *sql.DB
}

func (r *Repository) QueryRow(ctx context.Context, dest interface{}, query string, args ...interface{}) error {
    // Common query logic
    return r.db.QueryRowContext(ctx, query, args...).Scan(dest)
}

// Now use the abstraction
func (r *Repository) GetUser(ctx context.Context, userID string) (*User, error) {
    var user User
    err := r.QueryRow(ctx, &user,
        "SELECT id, name, email FROM users WHERE id = ?",
        userID,
    )
    return &user, err
}

2. Minimize Physical Layers

Do you really need a service mesh for 5 services? Do you really need an API gateway when your load balancer can handle routing? Each physical layer should justify its existence with a clear, measurable benefit.

Questions to ask:

  • What problem does this layer solve?
  • Can we solve it with a logical layer instead?
  • What’s the latency cost?
  • What’s the operational complexity cost?
  • What’s the debugging cost?

3. Make Abstractions Observable

Every abstraction layer should provide visibility into what it’s doing:

// Bad: Black box
func (s *Service) ProcessData(data []byte) error {
    return s.processor.Process(data)
}

// Good: Observable
func (s *Service) ProcessData(ctx context.Context, data []byte) error {
    start := time.Now()
    defer func() {
        s.metrics.ProcessingDuration.Observe(time.Since(start).Seconds())
    }()
    
    s.logger.Debug("Processing data",
        zap.Int("size", len(data)),
    )
    
    if err := s.processor.Process(ctx, data); err != nil {
        s.logger.Error("Processing failed",
            zap.Error(err),
            zap.Int("size", len(data)),
        )
        s.metrics.ProcessingFailures.Inc()
        return fmt.Errorf("processing failed: %w", err)
    }
    
    s.metrics.ProcessingSuccesses.Inc()
    s.logger.Info("Processing completed",
        zap.Int("size", len(data)),
        zap.Duration("duration", time.Since(start)),
    )
    
    return nil
}

4. Coordination by Convention

Instead of requiring developers to manually configure 6+ timeout values, provide templates that are correct by default:

// Bad: Manual configuration
type ServerConfig struct {
    ReadTimeout              time.Duration
    WriteTimeout             time.Duration
    IdleTimeout              time.Duration
    ShutdownTimeout          time.Duration
    KubernetesGracePeriod    time.Duration
    IstioTerminationDrain    time.Duration
    PreStopDelay             time.Duration
}

// Good: Convention-based
type ServerConfig struct {
    // Single source of truth
    GracefulShutdownSeconds int // Default: 45
}

func (c *ServerConfig) Defaults() {
    if c.GracefulShutdownSeconds == 0 {
        c.GracefulShutdownSeconds = 45
    }
}

func (c *ServerConfig) KubernetesGracePeriod() int {
    // Calculated: shutdown + buffer
    return c.GracefulShutdownSeconds + 20
}

func (c *ServerConfig) IstioTerminationDrain() int {
    // Same as graceful shutdown
    return c.GracefulShutdownSeconds
}

func (c *ServerConfig) PreStopDelay() int {
    // Fixed value for LB updates
    return 15
}

func (c *ServerConfig) ApplicationShutdownTimeout() time.Duration {
    // Calculated: grace period - prestop - buffer
    return time.Duration(c.GracefulShutdownSeconds-5) * time.Second
}

5. Invest in Developer Experience

The complexity tax is paid in developer hours. Make it easier:

// Bad: Complex local setup
// 1. Install Docker
// 2. Install Kubernetes (minikube/kind)
// 3. Install Istio
// 4. Configure service mesh
// 5. Deploy database
// 6. Deploy auth service
// 7. Deploy your service
// 8. Configure networking
// 9. Finally, test your code

// Good: One command
$ make dev
# Starts all dependencies with docker-compose
# Configures everything automatically
# Provides real-time logs
# Hot-reloads on code changes

Makefile:

.PHONY: dev
dev:
	docker-compose up -d postgres redis
	go run cmd/server/main.go

.PHONY: test
test:
	go test -v ./...

.PHONY: lint
lint:
	golangci-lint run

.PHONY: build
build:
	go build -o bin/server cmd/server/main.go

6. Embrace Mechanical Empathy

Understand what your abstractions are doing. Profile your applications. Use observability tools. Don’t cargo-cult patterns without understanding their costs.

// Use pprof to understand what your code is actually doing
import _ "net/http/pprof"

func main() {
    // Enable profiling endpoint
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    
    // Your application code
    server.Run()
}

// Then analyze:
// CPU profile: go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// Heap profile: go tool pprof http://localhost:6060/debug/pprof/heap
// Goroutine profile: go tool pprof http://localhost:6060/debug/pprof/goroutine

Learn to read the profiles. Understand where time is spent. Question assumptions.

A Glimpse of Hope: WebAssembly?

There’s an interesting thought experiment: what if we could replace Docker, Kubernetes, and service meshes by compiling code to WebAssembly and injecting necessary capabilities as logical layers without network hops?

The Promise (Where Java Failed)

Java promised “Write Once, Run Anywhere” (WORA) in the 1990s. It failed. Why?

  • Heavy JVM runtime overhead
  • Platform-specific JNI libraries
  • GUI frameworks that looked different on each OS
  • “Write once, debug everywhere” became the joke

WebAssembly might actually deliver on this promise because: It is a stack-based virtual machine with WASI (WebAssembly System Interface)—a standardized system API similar to POSIX. Solomon Hykes, creator of Docker, famously said:

“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!”Solomon Hykes, March 2019

Eliminating Network Hops

Current architecture (9 network hops):

WebAssembly architecture (1-2 network hops):

What changes:

  • Container (500MB) ? WASM binary (2-5MB)
  • Cold start (2-5 seconds) ? Instant (<100ms)
  • Sidecars eliminated ? Capabilities injected logically
  • 9 network hops ? 2-3 network hops
  • No coordination nightmare ? Single runtime config

The Instrumentation Problem Solved

Remember the 85 lines of observability code for 15 lines of business logic? With WASM:

// Your code - just business logic
func ProcessOrder(order Order) error {
    validateOrder(order)
    chargePayment(order)
    saveOrder(order)
}

// Runtime injects at deployment:
// - Authentication
// - Rate limiting  
// - Distributed tracing
// - Metrics
// - Logging
// All without code changes

What’s Missing?

WebAssembly isn’t ready yet. Critical gaps:

  • WASI maturity: Still evolving (Preview 2 in development)
  • Async I/O: Limited compared to native runtimes
  • Database drivers: Many don’t support WASM
  • Networking: WASI sockets still experimental
  • Ecosystem tooling: Debugging, profiling still primitive

But the trajectory is promising:

  • Cloudflare Workers, Fastly Compute@Edge (production WASM)
  • Major cloud providers investing heavily
  • CNCF projects (wasmCloud, Spin, WasmEdge)
  • Active development of Component Model and WASI

Why This Might Succeed (Unlike Java)

  1. Smaller runtime footprint (10-50MB vs 100-500MB JVM)
  2. True sandboxing (capability-based security, not just process isolation)
  3. No platform-specific dependencies (WASI standardizes system access)
  4. Native performance (AOT compilation, not JIT)
  5. Industry backing (Google, Microsoft, Mozilla, Fastly, Cloudflare)

The promise: compile once, run anywhere with the performance of native code and the security of containers—without the complexity. If WebAssembly fills these gaps, we could eliminate:

  • Docker images and registries
  • Kubernetes complexity
  • Service mesh overhead
  • Sidecar coordination nightmares
  • Most of the network hops we’ve accumulated

Conclusion: Abstraction as a Tool, Not a Goal

Abstraction should serve us, not the other way around. Every layer should earn its place by solving a problem better than the alternatives—considering both the benefits it provides and the complexity it introduces.

We’ve built systems so complex that:

  1. Learning to code takes 10x longer than it did in the 1980s
  2. New developers only understand top layers, lacking mechanical empathy
  3. Frameworks multiply faster than developers can learn them
  4. Network hops add latency, failure points, and debugging complexity
  5. Dependencies create supply chain vulnerabilities and compatibility nightmares
  6. Observability adds as much complexity as it solves
  7. Coordinating timeout values across layers causes production incidents
  8. Debugging requires access to 9+ different log sources

The industry will eventually swing back toward simplicity, as it always does. Monoliths are already making a comeback in certain contexts. “Majestic monoliths” are being celebrated. The pendulum swings. Until then, be ruthless about abstraction. Question every layer. Measure its costs. And remember:

The best code is not the most elegant or abstract—it’s the code that solves the problem clearly and can be understood by the team that has to maintain it.

In my career of writing software for over 30 years, I’ve learned one thing for certain: the code you write today will outlive your employment at the company. Make it simple enough that someone else can understand it when you’re gone. Make it observable enough that they can debug it when it breaks. And make it maintainable enough that they don’t curse your name when they have to change it.

September 29, 2025

Writing Post Mortems That Actually Make You Better: A Practitioner’s Guide

Filed under: Computing — admin @ 9:58 am

I’ve worked at organizations where engineers would sneak changes into production, bypassing CI/CD pipelines, hoping nobody would notice if something broke. I’ve also worked at places where engineers would openly discuss a failed experiment at standup and get help fixing it. The difference wasn’t the engineers—it was psychological safety. Research on psychological safety, particularly from high-stakes industries like healthcare, tells us something counterintuitive: teams with high psychological safety don’t have fewer incidents. They have better outcomes because people speak up about problems early.

Software engineering isn’t life-or-death medicine, but the principle holds: in blame cultures, I’ve watched talented engineers:

  • Deploy sketchy changes outside normal hours to avoid oversight
  • Ignore monitoring alerts, hoping they’re false positives
  • Blame infrastructure, legacy code, or “the previous team” rather than examining their contributions
  • Build elaborate workarounds instead of fixing root causes

These behaviors don’t just hurt morale—they actively degrade reliability. Post mortems in blame cultures become exercises in creative finger-pointing and CYA documentation.

In learning cultures, post mortems are gold mines of organizational knowledge. The rule of thumb I’ve seen work best: if you’re unsure whether something deserves a post mortem, write one anyway—at least internally. Not every post mortem needs wide distribution, and some (especially those with security implications) shouldn’t be shared externally. But the act of writing crystallizes learning.

The Real Problem: Post Mortem Theater

Here’s what nobody talks about: many organizations claim to value post mortems but treat them like bureaucratic checklists. I’ve seen hundreds of meticulously documented post mortems that somehow don’t prevent the same incidents from recurring. This is what I call “post mortem theater”—going through the motions without actual learning.

Shallow vs. Deep Analysis

Shallow analysis stops at the proximate cause:

  • “The database connection pool was exhausted”
  • “An engineer deployed buggy code”
  • “A dependency had high latency”

Deep analysis asks uncomfortable questions:

  • “Why don’t we load test with production-scale data? What makes it expensive to maintain realistic test environments?”
  • “Why did code review and automated tests miss this? What’s our philosophy on preventing bugs vs. recovering quickly?”
  • “Why do we tolerate single points of failure in our dependency chain? What would it take to build resilience?”

The difference between these approaches determines whether you’re learning or just documenting.

Narrow vs. Systemic Thinking

Narrow analysis fixes the immediate problem:

  • Add monitoring for connection pool utilization
  • Add a specific test case for the bug that escaped
  • Increase timeout values for the slow dependency

Systemic analysis asks meta questions:

  • “How do we systematically identify what we should be monitoring? Do we have a framework for this?”
  • “What patterns in our testing gaps led to this escape? Are we missing categories of testing?”
  • “What’s our philosophy on dependency management and resilience? Should we rethink our architecture?”

I’ve seen teams play post mortem bingo—hitting the same squares over and over. “No monitoring.” “Insufficient tests.” “Deployed during peak traffic.” “Rollback was broken.” When you see repeated patterns, you’re not learning from incidents—you’re collecting them. I have also written about common failures in distributed systems that can show up in recurring incidents if they are not properly addressed.

Understanding Complex System Failures

Modern systems fail in ways that defy simple “root cause” thinking. Consider a typical outage:

Surface story: Database connection pool exhausted
Deeper story:

  • A code change increased query volume 10x
  • Load testing used 1/10th production data and missed it
  • Connection pool monitoring didn’t exist
  • Alerts only monitored error rates, not resource utilization
  • Manual approval processes delayed rollback by 15 minutes
  • Staging environment configuration drifted from production

Which of these is the “root cause”? All of them. None of them individually would have caused an outage, but together they created a perfect storm. This is why I cringe when post mortems end with “human error” as the root cause. It’s not wrong—humans are involved in everything—but it’s useless. The question is: why was the error possible? What systemic changes make it impossible, or at least improbable?

You can think of this as the Swiss Cheese Model of failure: your system has multiple layers of defense (code review, testing, monitoring, gradual rollout, alerting, incident response). Each layer has holes. Most of the time, the holes don’t align and problems get caught. But occasionally, everything lines up perfectly and a problem slips through all layers. That’s your incident. This mental model is more useful than hunting for a single root cause because it focuses you on strengthening multiple layers of defense.

When to Write a Post Mortem

Always write one:

  • Customer-facing service disruptions
  • SLA/SLO breaches
  • Security incidents (keep these separate, limited distribution)
  • Complete service outages
  • Incidents requiring emergency escalation or multiple teams

Strongly consider:

  • Near-misses that made you think “we got lucky”
  • Interesting edge cases with valuable lessons
  • Internal issues that disrupted other teams
  • Process breakdowns causing significant project delays

The litmus test: If you’re debating whether it needs a post mortem, write at least an internal one. The discipline of writing forces clarity of thinking.

Post Mortem Ownership

Who should own writing it? The post mortem belongs to the team that owns addressing the root cause, not necessarily the team that triggered the incident or resolved it. If root cause is initially unclear, it belongs with whoever is investigating. If investigation reveals the root cause lies elsewhere, reassign it.

The Anatomy of an Effective Post Mortem

Title and Summary: The Elevator Pitch

Your title should state the customer-facing problem, not the cause.

Good: “Users unable to complete checkout for 23 minutes in US-EAST”
Bad: “Connection pool exhaustion caused outage”

Your summary should work as a standalone email to leadership. Include:

  • Brief service context (one sentence)
  • Timeline with time zones
  • Quantified customer impact
  • How long it lasted (from first customer impact to full recovery)
  • High-level cause
  • How it was resolved
  • Communication sent to customers (if applicable)

Timeline: The Narrative Spine

A good timeline tells the story of what happened, including system events and human decisions. Important: Your timeline should start with the first trigger that led to the problem (e.g., a deployment, a configuration change, a traffic spike), not just when your team got paged. The timeline should focus on the actual event start and end, not just your team’s perception of it.

All times PST

14:32 - Deployment begins in us-west-2
14:38 - Error rates spike to 15%
14:41 - Automated alerts fire
14:43 - On-call engineer begins investigation
14:47 - Customer support escalates: users reporting checkout failures
14:52 - Incident severity raised to SEV-1
15:03 - Root cause identified: connection pool exhaustion
15:07 - Rollback initiated
15:22 - Customer impact resolved, errors back to baseline

Key practices:

  • Start with the root trigger, not when you were notified
  • Consistent time zones throughout
  • Bold major milestones and customer-facing events
  • Include detection, escalation, and resolution times`
  • No gaps longer than 10-15 minutes without explanation
  • Use roles (“on-call engineer”) not names
  • Include both what the system did and what people did

Metrics: Show, Don’t Just Tell

Visual evidence is crucial. Include graphs showing:

  • Error rates during the incident
  • The specific resource that failed (connections, CPU, memory)
  • Business impact metrics (orders, logins, API calls)
  • Comparison with normal operation

For complex incidents involving multiple services, include a simple architecture diagram showing the relevant components and their interactions. This helps readers understand the failure chain without needing deep knowledge of your system.

Make graphs comparable:

  • Same time range across all graphs
  • Label your axes with units (milliseconds, percentage, requests/second)
  • Vertical lines marking key events
  • Include context before and after the incident
  • Embed actual screenshots, not just links that will break

Don’t do this:

  • Include 20 graphs because you can
  • Use different time zones between graphs
  • Forget to explain what the graph shows and why it matters

Service Context and Glossary

If your service uses specialized terminology or acronyms, add a brief glossary section or spell out all acronyms on first use. Your post mortem should be readable by engineers from other teams. For complex incidents, consider including:

  • Brief architecture overview (what are the key components?)
  • Links to related items (monitoring dashboards, deployment records, related tickets)
  • Key metrics definitions if not standard

Customer Impact: Get Specific

Never write “some customers were affected” or “significant impact.” Quantify everything:

Instead of: “Users experienced errors”
Write: “23,000 checkout attempts failed over 23 minutes, representing approximately $89,000 in failed transactions”

Instead of: “API latency increased”
Write: “P95 latency increased from 200ms to 3.2 seconds, affecting 15,000 API calls”

If you can’t get exact numbers, explain why and provide estimates with clear caveats.

Root Cause Analysis: Going Deeper

Use numbered lists (not bullets) for your Five Whys so people can easily reference them in discussions (“Why #4 seems incomplete…”). Use the Five Whys technique, but don’t stop at five if you need more. Start with the customer-facing problem and keep asking why:

1. Why did customers see checkout errors?
   -> Application servers returned 500 errors

2. Why did application servers return 500 errors?
   -> They couldn't connect to the database

3. Why couldn't they connect?
   -> Connection pool was exhausted

4. Why was the pool exhausted?
   -> New code made 10x more queries per request

5. Why didn't we catch this in testing?
   -> Staging uses 1/10th production data

6. Why is staging data volume so different?
   -> We haven't prioritized staging environment investment

Branch your analysis for multiple contributing factors. Number your branches (1.1, 1.2, etc.) to maintain traceability:

Primary Chain (1.x):

Why did customers see checkout errors?
 did customers see checkou Application servers returned 500 errors
[...]

Branch A - Detection (2.x):

Why did detection take 12 minutes?
-> We only monitor error rates, not resource utilization
Why don't we monitor resource utilization?
-> We haven't established a framework for what to monitor

Branch B - Mitigation (3.x):

Why did rollback take 15 minutes after identifying the cause?
-> Manual approval was required for production rollbacks
Why is manual approval required during emergencies?
-> Our process doesn't distinguish between routine and emergency changes

Never stop at:

  • “Human error”
  • “Process failure”
  • “Legacy system”

Keep asking why until you reach actionable systemic changes.

Incident Response Analysis

This section examines how you handled the crisis during the incident, not how to prevent it. This is distinct from post-incident analysis (root causing) which happens after. Focus on the temporal sequence of events:

  • Detection: How did you discover the problem? Automated alerts, customer reports, accidental discovery? How could you have detected it sooner?
  • Diagnosis: How long from “something’s wrong” to “we know what’s wrong”? What information or tools would have accelerated diagnosis?
  • Mitigation: How long from diagnosis to resolution? What would have made recovery faster?
  • Blast Radius: What percentage of customers/systems were affected? How could you have reduced the blast radius? Consider:
    • Would cellular architecture have isolated the failure?
    • Could gradual rollout have limited impact?
    • Did failure cascade to dependent systems unnecessarily?
    • Would circuit breakers have prevented cascade?

For each phase, ask: “How could we have cut this time in half?” And for blast radius: “How could we have cut the affected population in half?”

Post-Incident Analysis vs Real-Time Response

Be clear about the temporal distinction in your post mortem:

  • Incident Response Analysis = What happened DURING the incident
    • How we detected, diagnosed, and mitigated
    • Time-critical decisions under pressure
    • Effectiveness of runbooks and procedures
  • Post-Incident Analysis = What happened AFTER to understand root cause
    • How we diagnosed the underlying cause
    • What investigation techniques we used
    • How long root cause analysis took

This distinction matters because improvements differ: incident response improvements help you recover faster from any incident; post-incident improvements help you understand failures more quickly.

Lessons Learned: Universal Insights

Number your lessons learned (not bullets) so they can be easily referenced and linked to action items. Lessons should be broadly applicable beyond your specific incident:
1. Bad lesson learned: “We need connection pool monitoring”
Good lesson learned: “Services should monitor resource utilization for all constrained resources, not just error rates”
2. Bad lesson learned: “Load testing failed to catch this”
Good lesson learned: “Test environments that don’t reflect production characteristics will systematically miss production-specific issues”

Connect each lesson to specific action items by number reference (e.g., “Lesson #2 ? Action Items #5, #6”).

Action Items: Making Change Happen

This is where post mortems prove their value. Number your action items and explicitly link them to the lessons learned they address. Every action item needs:

  • Clear description: Not “improve monitoring” but “Add CloudWatch alarms for RDS connection pool utilization with thresholds at 75% (warning) and 90% (critical)”
  • Specific owner: A person’s name, not a team name
  • Realistic deadline: Most should complete within 45 days
  • Priority level:
    • High for root cause fixes and issues that directly caused customer impact
    • Medium for improvements to detection/mitigation
    • Low for nice-to-have improvements
  • Link to lesson learned: “Addresses Lesson #2”
  • Avoid action items that start with “investigate.” That’s not an action item—it’s procrastination. Do the investigation during the post mortem process and commit to specific changes.

Note: Your lessons learned should be universal principles that other teams could apply. Your action items should be specific changes your team will make. If your lessons learned just restate your action items, you’re missing the bigger picture.

Common Patterns That Indicate Shallow Learning

When you see the same issues appearing in multiple post mortems, you have a systemic problem:

  • Repeated monitoring gaps -> You don’t have a framework for determining what to monitor
  • Repeated test coverage issues -> Your testing philosophy or practices need examination
  • Repeated “worked in staging, failed in prod” -> Your staging environment strategy is flawed
  • Repeated manual process errors -> You’re over-relying on human perfection
  • Repeated deployment-related incidents -> Your deployment pipeline needs investment

These patterns are your organization’s immune system telling you something. Listen to it.

Common Pitfalls

After reading hundreds of post mortems, here are the traps I see teams fall into:

  • Writing for Insiders Only: Your post mortem should be readable by someone from another team. Explain your system’s architecture briefly, spell out acronyms, and assume your reader is smart but unfamiliar with your specific service.
  • Action Items That Start with “Investigate”: “Investigate better monitoring” is not an action item – it’s a placeholder for thinking you haven’t done yet. During the post mortem process, do the investigation and commit to specific changes.
  • Stopping at “Human Error”: If your Five Whys ends with “the engineer made a mistake,” you haven’t gone deep enough. Why was that mistake possible? What system changes would prevent it?
  • The Boil-the-Ocean Action Plan: Post mortems aren’t the place for your three-year architecture wish list. Focus on targeted improvements that directly address the incident’s causes and can be completed within a few months.

Ownership and Follow-Through

Here’s something that separates good teams from great ones: they actually complete their post mortem action items.

  • Assign clear ownership: Every action item needs a specific person (not a team) responsible for completion. That person might delegate the work, but they own the outcome.
  • Set realistic deadlines: Most action items should be completed within 45 days. If something will take longer, either break it down or put it in your regular backlog instead.
  • Track relentlessly: Use whatever task tracking system your team prefers, but make action item completion visible. Review progress in your regular team meetings.
  • Close the loop: When action items are complete, update the post mortem with links to the changes made. Future readers (including future you) will thank you.

Making Post Mortems Part of Your Culture

  • Write them quickly: Create the draft within 24 hours while memory is fresh. Complete the full post mortem within 14 days.
  • Get outside review (critical step): An experienced engineer from another team—sometimes called a “Bar Raiser”—should review for quality before you publish. The reviewer should check:
    • Would someone from another team understand and learn from this?
    • Are the lessons learned actually actionable?
    • Did you dig deep enough in your root cause analysis?
    • Are your action items specific and owned?
    • Does the incident response analysis identify concrete improvements?
  • Draft status: Keep the post mortem in draft/review status for at least 24 hours to gather feedback from stakeholders. Account for holidays and time zones for distributed teams.
  • Make them visible: Share widely (except security-sensitive ones) so other teams learn from your mistakes.
  • Customer communication: For customer-facing incidents, document what communication was sent:
    • Status page updates
    • Support team briefings
    • Proactive customer notifications
    • Post-incident follow-up
  • Track action items relentlessly: Use whatever task system you have. Review progress in regular meetings.
  • Review for patterns: Monthly or quarterly, look across all post mortems for systemic issues.
  • Celebrate learning: In team meetings, highlight interesting insights from post mortems. Make clear that thorough post mortems are valued, not punishment.
  • Train your people: Writing good post mortems is a skill. Share examples of excellent ones and give feedback.

Security-Sensitive Post Mortems

Some incidents involve security implications, sensitive customer data, or information that shouldn’t be widely shared. These still need documentation, but with appropriate access controls:

  • Create a separate, access-controlled version
  • Document what happened and how to prevent it
  • Share lessons learned (without sensitive details) more broadly
  • Work with your security team on appropriate distribution

The learning is still valuable—it just needs careful handling.

The Long Game

Post mortems are how organizations build institutional memory. They’re how you avoid becoming that team that keeps making the same mistakes. They’re how you onboard new engineers to the reality of your systems. Most importantly, they’re how you shift from a culture of blame to a culture of learning.

When your next incident happens—and it will—remember you’re not just fixing a problem. You’re gathering intelligence about how your system really behaves under stress. You’re building your team’s capability to handle whatever comes next. Write the post mortem you wish you’d had during the incident. Be honest about what went wrong. Be specific about what you’ll change. Be generous in sharing what you learned.

Your future self, your teammates, and your customers will all benefit from it. And remember: if you’re not sure whether something deserves a post mortem, write one anyway. The discipline of analysis is never wasted.

September 17, 2025

Transaction Boundaries: The Foundation of Reliable Systems

Filed under: Computing,Concurrency — admin @ 11:19 am

Over the years, I have seen countless production issues due to improper transaction management. A typical example: an API requires changes to multiple database tables, and each update is wrapped in different methods without proper transaction boundaries. This works fine when everything goes smoothly, but due to database constraints or other issues, a secondary database update might fail. In too many cases, the code doesn’t handle proper rollback and just throws an error—leaving the database in an inconsistent state.

In other cases, I’ve debugged production bugs due to improper coordination between database updates and event queues, where we desperately needed atomic behavior. I used J2EE in the late 1990s and early 2000s, which provided support for two-phase commit (2PC) to coordinate multiple updates across resources. However, 2PC wasn’t a scalable solution. I then experimented with aspect-oriented programming like AspectJ to handle cross-cutting concerns like transaction management, but it resulted in more complex code that was difficult to debug and maintain.

Later, I moved to Java Spring, which provided annotations for transaction management. This was both efficient and elegant—the @Transactional annotation made transaction boundaries explicit without cluttering business logic. When I worked at a travel booking company where we had to coordinate flight reservations, hotel bookings, car rentals, and insurance through various vendor APIs, I built a transaction framework based on the command pattern and chain of responsibility. This worked well for issuing compensating transactions when a remote API call failed midway through our public API workflow.

However, when I moved to Go and Rust, I found a lack of these basic transaction management primitives. I often see bugs in Go and Rust codebases that could have been caught earlier—many implementations assume the happy path and don’t properly handle partial failures or rollback scenarios.

In this blog, I’ll share learnings from my experience across different languages and platforms. I’ll cover best practices for establishing proper transaction boundaries, from single-database ACID transactions to distributed SAGA patterns, with working examples in Java/Spring, Go, and Rust. The goal isn’t just to prevent data corruption—it’s to build systems you can reason about, debug, and trust.

The Happy Path Fallacy

Most developers write code assuming everything will work perfectly. Here’s a typical “happy path” implementation:

// This looks innocent but is fundamentally broken
public class OrderService {
    public void processOrder(Order order) {
        orderRepository.save(order);           // What if this succeeds...
        paymentService.chargeCard(order);      // ...but this fails?
        inventoryService.allocate(order);      // Now we have inconsistent state
        emailService.sendConfirmation(order);  // And this might never happen
    }
}

The problem isn’t just that operations can fail—it’s that partial failures leave your system in an undefined state. Without proper transaction boundaries, you’re essentially playing Russian roulette with your data integrity. In my experience analyzing production systems, I’ve found that most data corruption doesn’t come from dramatic failures or outages. It comes from these subtle, partial failures that happen during normal operation. A network timeout here, a service restart there, and suddenly your carefully designed system is quietly hemorrhaging data consistency.

Transaction Fundamentals

Before we dive into robust transaction management in our applications, we need to understand what databases actually provide and how they achieve consistency guarantees. Most developers treat transactions as a black box—call BEGIN, do some work, call COMMIT, and hope for the best. But understanding the underlying mechanisms is crucial for making informed decisions about isolation levels, recognizing performance implications, and debugging concurrency issues when they inevitably arise in production. Let’s examine the foundational concepts that every developer working with transactions should understand.

The ACID Foundation

Before diving into implementation patterns, let’s establish why ACID properties matter:

  • Atomicity: Either all operations in a transaction succeed, or none do
  • Consistency: The database remains in a valid state before and after the transaction
  • Isolation: Concurrent transactions don’t interfere with each other
  • Durability: Once committed, changes survive system failures

These aren’t academic concepts—they’re the guardrails that prevent your system from sliding into chaos. Let’s see how different languages and frameworks help us maintain these guarantees.

Isolation Levels: The Hidden Performance vs Consistency Tradeoff

Most developers don’t realize that their database isn’t using the strictest isolation level by default. In fact, most production databases (MySQL, PostgreSQL, Oracle, SQL Server) default to READ COMMITTED, not SERIALIZABLE. This creates subtle race conditions that can lead to double spending and other financial disasters.

// The double spending problem with default isolation
@Service
public class VulnerableAccountService {
    
    // This uses READ COMMITTED by default - DANGEROUS for financial operations!
    @Transactional
    public void withdrawFunds(String accountId, BigDecimal amount) {
        Account account = accountRepository.findById(accountId);
        
        // RACE CONDITION: Another transaction can modify balance here!
        if (account.getBalance().compareTo(amount) >= 0) {
            account.setBalance(account.getBalance().subtract(amount));
            accountRepository.save(account);
        } else {
            throw new InsufficientFundsException();
        }
    }
}

// What happens with concurrent requests:
// Thread 1: Read balance = $100, check passes
// Thread 2: Read balance = $100, check passes  
// Thread 1: Withdraw $100, balance = $0
// Thread 2: Withdraw $100, balance = -$100 (DOUBLE SPENDING!)

Database Default Isolation Levels

DatabaseDefault IsolationFinancial Safety
PostgreSQLREAD COMMITTED? Vulnerable
MySQLREPEATABLE READ?? Better but not perfect
OracleREAD COMMITTED? Vulnerable
SQL ServerREAD COMMITTED? Vulnerable
H2/HSQLDBREAD COMMITTED? Vulnerable

The Right Way: Database Constraints + Proper Isolation

// Method 1: Database constraints (fastest)
@Entity
@Table(name = "accounts")
public class Account {
    @Id
    private String accountId;
    
    @Column(nullable = false)
    @Check(constraints = "balance >= 0") // Database prevents negative balance
    private BigDecimal balance;
    
    @Version
    private Long version;
}

@Service
public class SafeAccountService {
    
    // Let database constraint handle the race condition
    @Transactional
    public void withdrawFundsWithConstraint(String accountId, BigDecimal amount) {
        try {
            Account account = accountRepository.findById(accountId);
            account.setBalance(account.getBalance().subtract(amount));
            accountRepository.save(account); // Database throws exception if balance < 0
        } catch (DataIntegrityViolationException e) {
            throw new InsufficientFundsException("Withdrawal would result in negative balance");
        }
    }
    
    // Method 2: SERIALIZABLE isolation (most secure)
    @Transactional(isolation = Isolation.SERIALIZABLE)
    public void withdrawFundsSerializable(String accountId, BigDecimal amount) {
        Account account = accountRepository.findById(accountId);
        if (account.getBalance().compareTo(amount) >= 0) {
            account.setBalance(account.getBalance().subtract(amount));
            accountRepository.save(account);
        } else {
            throw new InsufficientFundsException();
        }
        // SERIALIZABLE guarantees no other transaction can interfere
    }
    
    // Method 3: Optimistic locking (good performance)
    @Transactional
    @Retryable(value = {OptimisticLockingFailureException.class}, maxAttempts = 3)
    public void withdrawFundsOptimistic(String accountId, BigDecimal amount) {
        Account account = accountRepository.findById(accountId);
        if (account.getBalance().compareTo(amount) >= 0) {
            account.setBalance(account.getBalance().subtract(amount));
            accountRepository.save(account); // Version check prevents race conditions
        } else {
            throw new InsufficientFundsException();
        }
    }
}

MVCC

Most developers don’t realize that modern databases achieve isolation levels through Multi-Version Concurrency Control (MVCC), not traditional locking. Understanding MVCC explains why certain isolation behaviors seem counterintuitive. Instead of locking rows for reads, databases maintain multiple versions of each row with timestamps. When you start a transaction, you get a consistent snapshot of the database as it existed at that moment.

// What actually happens under MVCC
@Transactional(isolation = Isolation.REPEATABLE_READ)
public void demonstrateMVCC() {
    // T1: Transaction starts, gets snapshot at time=100
    Account account = accountRepository.findById("123"); // Reads version at time=100
    
    // T2: Another transaction modifies the account (creates version at time=101)
    
    // T1: Reads same account again
    Account sameAccount = accountRepository.findById("123"); // Still reads version at time=100!
    
    assert account.getBalance().equals(sameAccount.getBalance()); // MVCC guarantees this
}

MVCC vs Traditional Locking

-- Traditional locking approach (not MVCC)
BEGIN TRANSACTION;
SELECT * FROM accounts WHERE id = '123' FOR SHARE; -- Acquires shared lock
-- Other transactions blocked from writing until this transaction ends

-- MVCC approach (PostgreSQL, MySQL InnoDB)
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM accounts WHERE id = '123'; -- No locks, reads from snapshot
-- Other transactions can write freely, creating new versions

MVCC delivers better performance and reduces deadlock contention compared to traditional locking, but it comes with cleanup overhead requirements (PostgreSQL VACUUM, MySQL purge operations). I have encountered numerous production issues where real-time queries or ETL jobs would suddenly degrade in performance due to aggressive background VACUUM operations on older PostgreSQL versions, though recent versions have significantly improved this behavior. MVCC can also lead to stale reads in long-running transactions, as they maintain their snapshot view even as the underlying data changes.

// MVCC write conflict example
@Transactional
@Retryable(value = {OptimisticLockingFailureException.class})
public void updateAccountMVCC(String accountId, BigDecimal newBalance) {
    Account account = accountRepository.findById(accountId);
    
    // If another transaction modified this account between our read
    // and write, MVCC will detect the conflict and retry
    account.setBalance(newBalance);
    accountRepository.save(account); // May throw OptimisticLockingFailureException
}

This is why PostgreSQL defaults to READ COMMITTED and why long-running analytical queries should use dedicated read replicas—MVCC snapshots can become expensive to maintain over time.

Java and Spring: The Gold Standard

Spring’s @Transactional annotation is probably the most elegant solution I’ve encountered for transaction management. It uses aspect-oriented programming to wrap methods in transaction boundaries, making the complexity invisible to business logic.

Basic Transaction Management

@Service
@Transactional
public class OrderService {
    
    @Autowired
    private OrderRepository orderRepository;
    
    @Autowired
    private PaymentService paymentService;
    
    @Autowired
    private InventoryService inventoryService;
    
    // All operations within this method are atomic
    public Order processOrder(CreateOrderRequest request) {
        Order order = new Order(request);
        order = orderRepository.save(order);
        
        // If any of these fail, everything rolls back
        Payment payment = paymentService.processPayment(
            order.getCustomerId(), 
            order.getTotalAmount()
        );
        
        inventoryService.reserveItems(order.getItems());
        order.setPaymentId(payment.getId());
        order.setStatus(OrderStatus.CONFIRMED);
        
        return orderRepository.save(order);
    }
}

Different Transaction Types

Spring provides fine-grained control over transaction behavior:

@Service
public class OrderService {
    
    // Read-only transactions can be optimized by the database
    @Transactional(readOnly = true)
    public List<Order> getOrderHistory(String customerId) {
        return orderRepository.findByCustomerId(customerId);
    }
    
    // Long-running operations need higher timeout
    @Transactional(timeout = 300) // 5 minutes
    public void processBulkOrders(List<CreateOrderRequest> requests) {
        for (CreateOrderRequest request : requests) {
            processOrder(request);
        }
    }
    
    // Critical operations need strict isolation
    @Transactional(isolation = Isolation.SERIALIZABLE)
    public void transferInventory(String fromLocation, String toLocation, 
                                String itemId, int quantity) {
        Item fromItem = inventoryRepository.findByLocationAndItem(fromLocation, itemId);
        Item toItem = inventoryRepository.findByLocationAndItem(toLocation, itemId);
        
        if (fromItem.getQuantity() < quantity) {
            throw new InsufficientInventoryException();
        }
        
        fromItem.setQuantity(fromItem.getQuantity() - quantity);
        toItem.setQuantity(toItem.getQuantity() + quantity);
        
        inventoryRepository.save(fromItem);
        inventoryRepository.save(toItem);
    }
    
    // Some operations should create new transactions
    @Transactional(propagation = Propagation.REQUIRES_NEW)
    public void logAuditEvent(String event, String details) {
        AuditLog log = new AuditLog(event, details, Instant.now());
        auditRepository.save(log);
        // This commits immediately, independent of calling transaction
    }
    
    // Handle specific rollback conditions
    @Transactional(rollbackFor = {BusinessException.class, ValidationException.class})
    public void processComplexOrder(ComplexOrderRequest request) {
        // Business logic that might throw business exceptions
        validateOrderRules(request);
        Order order = createOrder(request);
        processPayment(order);
    }
}

Nested Transactions and Propagation

Understanding nested transactions is critical for building robust systems. In some cases, you want a child transaction to succeed regardless of whether the parent transaction succeeds or not—these are often called “autonomous transactions” or “independent transactions.” The solution was to use REQUIRES_NEW propagation for audit operations, creating independent transactions that commit immediately regardless of what happens to the parent transaction. Similarly, for notification services, you typically want notifications to be sent even if the business operation partially fails—users should know that something went wrong.

@Service
public class OrderProcessingService {
    
    @Autowired
    private OrderService orderService;
    
    @Autowired
    private NotificationService notificationService;
    
    @Transactional
    public void processOrderWithNotification(CreateOrderRequest request) {
        // This participates in the existing transaction
        Order order = orderService.processOrder(request);
        
        // This creates a new transaction that commits independently
        notificationService.sendOrderConfirmation(order);
        
        // If something fails here, the order transaction can still commit
        // but the notification might not be sent
    }
}

@Service
public class NotificationService {
    
    // Creates a new transaction - notifications are sent even if 
    // the main order processing fails later
    @Transactional(propagation = Propagation.REQUIRES_NEW)
    public void sendOrderConfirmation(Order order) {
        NotificationRecord record = new NotificationRecord(
            order.getCustomerId(),
            "Order confirmed: " + order.getId(),
            NotificationType.ORDER_CONFIRMATION
        );
        notificationRepository.save(record);
        
        // Send actual notification asynchronously
        emailService.sendAsync(order.getCustomerEmail(), 
                              "Order Confirmation", 
                              generateOrderEmail(order));
    }
}

Go with GORM: Explicit Transaction Management

Go doesn’t have the luxury of annotations, so transaction management becomes more explicit. This actually has benefits—the transaction boundaries are clearly visible in the code.

Basic GORM Transactions

package services

import (
    "context"
    "fmt"
    "gorm.io/gorm"
)

type OrderService struct {
    db *gorm.DB
}

type Order struct {
    ID          uint   `gorm:"primarykey"`
    CustomerID  string
    TotalAmount int64
    Status      string
    PaymentID   string
    Items       []OrderItem `gorm:"foreignKey:OrderID"`
}

type OrderItem struct {
    ID       uint   `gorm:"primarykey"`
    OrderID  uint
    SKU      string
    Quantity int
    Price    int64
}

// Basic transaction with explicit rollback handling
func (s *OrderService) ProcessOrder(ctx context.Context, request CreateOrderRequest) (*Order, error) {
    tx := s.db.Begin()
    defer func() {
        if r := recover(); r != nil {
            tx.Rollback()
            panic(r)
        }
    }()

    order := &Order{
        CustomerID:  request.CustomerID,
        TotalAmount: request.TotalAmount,
        Status:      "PENDING",
    }

    // Save the order
    if err := tx.Create(order).Error; err != nil {
        tx.Rollback()
        return nil, fmt.Errorf("failed to create order: %w", err)
    }

    // Process payment
    paymentID, err := s.processPayment(ctx, tx, order)
    if err != nil {
        tx.Rollback()
        return nil, fmt.Errorf("payment failed: %w", err)
    }

    // Reserve inventory
    if err := s.reserveInventory(ctx, tx, request.Items); err != nil {
        tx.Rollback()
        return nil, fmt.Errorf("inventory reservation failed: %w", err)
    }

    // Update order with payment info
    order.PaymentID = paymentID
    order.Status = "CONFIRMED"
    if err := tx.Save(order).Error; err != nil {
        tx.Rollback()
        return nil, fmt.Errorf("failed to update order: %w", err)
    }

    if err := tx.Commit().Error; err != nil {
        return nil, fmt.Errorf("failed to commit transaction: %w", err)
    }

    return order, nil
}

Functional Transaction Wrapper

To reduce boilerplate, we can create a transaction wrapper:

// TransactionFunc represents a function that runs within a transaction
type TransactionFunc func(tx *gorm.DB) error

// WithTransaction wraps a function in a database transaction
func (s *OrderService) WithTransaction(fn TransactionFunc) error {
    tx := s.db.Begin()
    defer func() {
        if r := recover(); r != nil {
            tx.Rollback()
            panic(r)
        }
    }()

    if err := fn(tx); err != nil {
        tx.Rollback()
        return err
    }

    return tx.Commit().Error
}

// Now our business logic becomes cleaner
func (s *OrderService) ProcessOrderClean(ctx context.Context, request CreateOrderRequest) (*Order, error) {
    var order *Order
    
    err := s.WithTransaction(func(tx *gorm.DB) error {
        order = &Order{
            CustomerID:  request.CustomerID,
            TotalAmount: request.TotalAmount,
            Status:      "PENDING",
        }

        if err := tx.Create(order).Error; err != nil {
            return fmt.Errorf("failed to create order: %w", err)
        }

        paymentID, err := s.processPaymentInTx(ctx, tx, order)
        if err != nil {
            return fmt.Errorf("payment failed: %w", err)
        }

        if err := s.reserveInventoryInTx(ctx, tx, request.Items); err != nil {
            return fmt.Errorf("inventory reservation failed: %w", err)
        }

        order.PaymentID = paymentID
        order.Status = "CONFIRMED"
        
        return tx.Save(order).Error
    })

    return order, err
}

Context-Based Transaction Management

For more sophisticated transaction management, we can use context to pass transactions:

type contextKey string

const txKey contextKey = "transaction"

// WithTransactionContext creates a new context with a transaction
func WithTransactionContext(ctx context.Context, tx *gorm.DB) context.Context {
    return context.WithValue(ctx, txKey, tx)
}

// TxFromContext retrieves a transaction from context
func TxFromContext(ctx context.Context) (*gorm.DB, bool) {
    tx, ok := ctx.Value(txKey).(*gorm.DB)
    return tx, ok
}

// GetDB returns either the transaction from context or the main DB
func (s *OrderService) GetDB(ctx context.Context) *gorm.DB {
    if tx, ok := TxFromContext(ctx); ok {
        return tx
    }
    return s.db
}

// Now services can automatically use transactions when available
func (s *PaymentService) ProcessPayment(ctx context.Context, customerID string, amount int64) (string, error) {
    db := s.GetDB(ctx) // Uses transaction if available
    
    payment := &Payment{
        CustomerID: customerID,
        Amount:     amount,
        Status:     "PROCESSING",
    }
    
    if err := db.Create(payment).Error; err != nil {
        return "", err
    }
    
    // Simulate payment processing
    if amount > 100000 { // Reject large amounts for demo
        payment.Status = "FAILED"
        db.Save(payment)
        return "", fmt.Errorf("payment amount too large")
    }
    
    payment.Status = "COMPLETED"
    payment.TransactionID = generatePaymentID()
    
    if err := db.Save(payment).Error; err != nil {
        return "", err
    }
    
    return payment.TransactionID, nil
}

// Usage with context-based transactions
func (s *OrderService) ProcessOrderWithContext(ctx context.Context, request CreateOrderRequest) (*Order, error) {
    var order *Order
    
    return order, s.WithTransaction(func(tx *gorm.DB) error {
        // Create context with transaction
        txCtx := WithTransactionContext(ctx, tx)
        
        order = &Order{
            CustomerID:  request.CustomerID,
            TotalAmount: request.TotalAmount,
            Status:      "PENDING",
        }

        if err := tx.Create(order).Error; err != nil {
            return err
        }

        // These services will automatically use the transaction
        paymentID, err := s.paymentService.ProcessPayment(txCtx, order.CustomerID, order.TotalAmount)
        if err != nil {
            return err
        }

        if err := s.inventoryService.ReserveItems(txCtx, request.Items); err != nil {
            return err
        }

        order.PaymentID = paymentID
        order.Status = "CONFIRMED"
        
        return tx.Save(order).Error
    })
}

Read-Only and Isolation Control

// Read-only operations can be optimized
func (s *OrderService) GetOrderHistory(ctx context.Context, customerID string) ([]Order, error) {
    var orders []Order
    
    // Use read-only transaction for consistency
    err := s.db.Transaction(func(tx *gorm.DB) error {
        return tx.Raw("SELECT * FROM orders WHERE customer_id = ? ORDER BY created_at DESC", 
                     customerID).Scan(&orders).Error
    }, &sql.TxOptions{ReadOnly: true})
    
    return orders, err
}

// Operations requiring specific isolation levels
func (s *InventoryService) TransferStock(ctx context.Context, fromSKU, toSKU string, quantity int) error {
    return s.db.Transaction(func(tx *gorm.DB) error {
        var fromItem, toItem InventoryItem
        
        // Lock rows to prevent concurrent modifications
        if err := tx.Set("gorm:query_option", "FOR UPDATE").
            Where("sku = ?", fromSKU).First(&fromItem).Error; err != nil {
            return err
        }
        
        if err := tx.Set("gorm:query_option", "FOR UPDATE").
            Where("sku = ?", toSKU).First(&toItem).Error; err != nil {
            return err
        }
        
        if fromItem.Quantity < quantity {
            return fmt.Errorf("insufficient inventory")
        }
        
        fromItem.Quantity -= quantity
        toItem.Quantity += quantity
        
        if err := tx.Save(&fromItem).Error; err != nil {
            return err
        }
        
        return tx.Save(&toItem).Error
        
    }, &sql.TxOptions{Isolation: sql.LevelSerializable})
}

Rust: Custom Transaction Annotations with Macros

Rust doesn’t have runtime annotations like Java, but we can create compile-time macros that provide similar functionality. This approach gives us zero-runtime overhead while maintaining clean syntax.

Building a Transaction Macro System

First, let’s create the macro infrastructure:

// src/transaction/mod.rs
use diesel::prelude::*;
use diesel::result::Error as DieselError;
use std::fmt;

#[derive(Debug)]
pub enum TransactionError {
    Database(DieselError),
    Business(String),
    Validation(String),
}

impl fmt::Display for TransactionError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match self {
            TransactionError::Database(e) => write!(f, "Database error: {}", e),
            TransactionError::Business(e) => write!(f, "Business error: {}", e),
            TransactionError::Validation(e) => write!(f, "Validation error: {}", e),
        }
    }
}

impl std::error::Error for TransactionError {}

pub type TransactionResult<T> = Result<T, TransactionError>;

// Macro for creating transactional functions
#[macro_export]
macro_rules! transactional {
    (
        fn $name:ident($($param:ident: $param_type:ty),*) -> $return_type:ty {
            $($body:tt)*
        }
    ) => {
        fn $name(conn: &mut PgConnection, $($param: $param_type),*) -> TransactionResult<$return_type> {
            conn.transaction::<$return_type, TransactionError, _>(|conn| {
                $($body)*
            })
        }
    };
}

// Macro for read-only transactions
#[macro_export]
macro_rules! read_only {
    (
        fn $name:ident($($param:ident: $param_type:ty),*) -> $return_type:ty {
            $($body:tt)*
        }
    ) => {
        fn $name(conn: &mut PgConnection, $($param: $param_type),*) -> TransactionResult<$return_type> {
            // In a real implementation, we'd set READ ONLY mode
            conn.transaction::<$return_type, TransactionError, _>(|conn| {
                $($body)*
            })
        }
    };
}

Using the Transaction Macros

// src/services/order_service.rs
use diesel::prelude::*;
use crate::transaction::*;
use crate::models::*;
use crate::schema::orders::dsl::*;

pub struct OrderService;

impl OrderService {
    // Transactional order processing with automatic rollback
    transactional! {
        fn process_order(request: CreateOrderRequest) -> Order {
            // Create the order
            let new_order = NewOrder {
                customer_id: &request.customer_id,
                total_amount: request.total_amount,
                status: "PENDING",
            };
            
            let order: Order = diesel::insert_into(orders)
                .values(&new_order)
                .get_result(conn)
                .map_err(TransactionError::Database)?;

            // Process payment
            let payment_id = Self::process_payment_internal(conn, &order)
                .map_err(|e| TransactionError::Business(format!("Payment failed: {}", e)))?;

            // Reserve inventory
            Self::reserve_inventory_internal(conn, &request.items)
                .map_err(|e| TransactionError::Business(format!("Inventory reservation failed: {}", e)))?;

            // Update order with payment info
            let updated_order = diesel::update(orders.filter(id.eq(order.id)))
                .set((
                    payment_id.eq(&payment_id),
                    status.eq("CONFIRMED"),
                ))
                .get_result(conn)
                .map_err(TransactionError::Database)?;

            Ok(updated_order)
        }
    }

    // Read-only transaction for queries
    read_only! {
        fn get_order_history(customer_id: String) -> Vec<Order> {
            let order_list = orders
                .filter(customer_id.eq(&customer_id))
                .order(created_at.desc())
                .load::<Order>(conn)
                .map_err(TransactionError::Database)?;
            
            Ok(order_list)
        }
    }

    // Helper functions that work within existing transactions
    fn process_payment_internal(conn: &mut PgConnection, order: &Order) -> Result<String, String> {
        use crate::schema::payments::dsl::*;
        
        let new_payment = NewPayment {
            customer_id: &order.customer_id,
            order_id: order.id,
            amount: order.total_amount,
            status: "PROCESSING",
        };
        
        let payment: Payment = diesel::insert_into(payments)
            .values(&new_payment)
            .get_result(conn)
            .map_err(|e| format!("Payment creation failed: {}", e))?;
        
        // Simulate payment processing logic
        if order.total_amount > 100000 {
            diesel::update(payments.filter(id.eq(payment.id)))
                .set(status.eq("FAILED"))
                .execute(conn)
                .map_err(|e| format!("Payment update failed: {}", e))?;
            
            return Err("Payment amount too large".to_string());
        }
        
        let transaction_id = format!("txn_{}", uuid::Uuid::new_v4());
        
        diesel::update(payments.filter(id.eq(payment.id)))
            .set((
                status.eq("COMPLETED"),
                transaction_id.eq(&transaction_id),
            ))
            .execute(conn)
            .map_err(|e| format!("Payment finalization failed: {}", e))?;
        
        Ok(transaction_id)
    }

    fn reserve_inventory_internal(conn: &mut PgConnection, items: &[OrderItemRequest]) -> Result<(), String> {
        use crate::schema::inventory::dsl::*;
        
        for item in items {
            // Lock the inventory row for update
            let mut inventory_item: InventoryItem = inventory
                .filter(sku.eq(&item.sku))
                .for_update()
                .first(conn)
                .map_err(|e| format!("Inventory lookup failed: {}", e))?;
            
            if inventory_item.quantity < item.quantity {
                return Err(format!("Insufficient inventory for SKU: {}", item.sku));
            }
            
            inventory_item.quantity -= item.quantity;
            
            diesel::update(inventory.filter(sku.eq(&item.sku)))
                .set(quantity.eq(inventory_item.quantity))
                .execute(conn)
                .map_err(|e| format!("Inventory update failed: {}", e))?;
        }
        
        Ok(())
    }
}

Advanced Transaction Features in Rust

// More sophisticated transaction management with isolation levels
#[macro_export]
macro_rules! serializable_transaction {
    (
        fn $name:ident($($param:ident: $param_type:ty),*) -> $return_type:ty {
            $($body:tt)*
        }
    ) => {
        fn $name(conn: &mut PgConnection, $($param: $param_type),*) -> TransactionResult<$return_type> {
            // Set serializable isolation level
            conn.batch_execute("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE")
                .map_err(TransactionError::Database)?;
                
            conn.transaction::<$return_type, TransactionError, _>(|conn| {
                $($body)*
            })
        }
    };
}

// Usage for operations requiring strict consistency
impl InventoryService {
    serializable_transaction! {
        fn transfer_stock(from_sku: String, to_sku: String, quantity: i32) -> (InventoryItem, InventoryItem) {
            use crate::schema::inventory::dsl::*;
            
            // Lock both items in consistent order to prevent deadlocks
            let (first_sku, second_sku) = if from_sku < to_sku {
                (&from_sku, &to_sku)
            } else {
                (&to_sku, &from_sku)
            };
            
            let mut from_item: InventoryItem = inventory
                .filter(sku.eq(first_sku))
                .for_update()
                .first(conn)
                .map_err(TransactionError::Database)?;
                
            let mut to_item: InventoryItem = inventory
                .filter(sku.eq(second_sku))
                .for_update()
                .first(conn)
                .map_err(TransactionError::Database)?;
            
            // Ensure we have the right items
            if from_item.sku != from_sku {
                std::mem::swap(&mut from_item, &mut to_item);
            }
            
            if from_item.quantity < quantity {
                return Err(TransactionError::Business(
                    "Insufficient inventory for transfer".to_string()
                ));
            }
            
            from_item.quantity -= quantity;
            to_item.quantity += quantity;
            
            let updated_from = diesel::update(inventory.filter(sku.eq(&from_sku)))
                .set(quantity.eq(from_item.quantity))
                .get_result(conn)
                .map_err(TransactionError::Database)?;
                
            let updated_to = diesel::update(inventory.filter(sku.eq(&to_sku)))
                .set(quantity.eq(to_item.quantity))
                .get_result(conn)
                .map_err(TransactionError::Database)?;
            
            Ok((updated_from, updated_to))
        }
    }
}

Async Transaction Support

For modern Rust applications using async/await:

// src/transaction/async_transaction.rs
use diesel_async::{AsyncPgConnection, AsyncConnection};
use diesel_async::pooled_connection::bb8::Pool;

#[macro_export]
macro_rules! async_transactional {
    (
        async fn $name:ident($($param:ident: $param_type:ty),*) -> $return_type:ty {
            $($body:tt)*
        }
    ) => {
        async fn $name(pool: &Pool<AsyncPgConnection>, $($param: $param_type),*) -> TransactionResult<$return_type> {
            let mut conn = pool.get().await
                .map_err(|e| TransactionError::Database(e.into()))?;
                
            conn.transaction::<$return_type, TransactionError, _>(|conn| {
                Box::pin(async move {
                    $($body)*
                })
            }).await
        }
    };
}

// Usage with async operations
impl OrderService {
    async_transactional! {
        async fn process_order_async(request: CreateOrderRequest) -> Order {
            // All the same logic as before, but with async/await support
            let new_order = NewOrder {
                customer_id: &request.customer_id,
                total_amount: request.total_amount,
                status: "PENDING",
            };
            
            let order: Order = diesel::insert_into(orders)
                .values(&new_order)
                .get_result(conn)
                .await
                .map_err(TransactionError::Database)?;

            // Process payment asynchronously
            let payment_id = Self::process_payment_async(conn, &order).await
                .map_err(|e| TransactionError::Business(format!("Payment failed: {}", e)))?;

            // Continue with order processing...
            
            Ok(order)
        }
    }
}

Multi-Database Transactions: Two-Phase Commit

I used J2EE and XA transactions extensively in the late 1990s and early 2000s when these standards were being defined by Sun Microsystems with major contributions from IBM, Oracle, and BEA Systems. While these technologies provided strong consistency guarantees, they added enormous complexity to applications and resulted in significant performance issues. The fundamental problem with 2PC is that it’s a blocking protocol—if the transaction coordinator fails during the commit phase, all participating databases remain locked until the coordinator recovers. I’ve seen production systems grind to a halt for hours because of coordinator failures. There are also edge cases that 2PC simply cannot handle, such as network partitions between the coordinator and participants, which led to the development of three-phase commit (3PC). In most cases, you should avoid distributed transactions entirely and use patterns like SAGA, event sourcing, or careful service boundaries instead.

Java XA Transactions

@Configuration
@EnableTransactionManagement
public class XATransactionConfig {
    
    @Bean
    @Primary
    public DataSource orderDataSource() {
        MysqlXADataSource xaDataSource = new MysqlXADataSource();
        xaDataSource.setURL("jdbc:mysql://localhost:3306/orders");
        xaDataSource.setUser("orders_user");
        xaDataSource.setPassword("orders_pass");
        return xaDataSource;
    }
    
    @Bean
    public DataSource inventoryDataSource() {
        MysqlXADataSource xaDataSource = new MysqlXADataSource();
        xaDataSource.setURL("jdbc:mysql://localhost:3306/inventory");
        xaDataSource.setUser("inventory_user");
        xaDataSource.setPassword("inventory_pass");
        return xaDataSource;
    }
    
    @Bean
    public JtaTransactionManager jtaTransactionManager() {
        JtaTransactionManager jtaTransactionManager = new JtaTransactionManager();
        jtaTransactionManager.setTransactionManager(atomikosTransactionManager());
        jtaTransactionManager.setUserTransaction(atomikosUserTransaction());
        return jtaTransactionManager;
    }
    
    @Bean(initMethod = "init", destroyMethod = "close")
    public UserTransactionManager atomikosTransactionManager() {
        UserTransactionManager transactionManager = new UserTransactionManager();
        transactionManager.setForceShutdown(false);
        return transactionManager;
    }
    
    @Bean
    public UserTransactionImp atomikosUserTransaction() throws SystemException {
        UserTransactionImp userTransactionImp = new UserTransactionImp();
        userTransactionImp.setTransactionTimeout(300);
        return userTransactionImp;
    }
}

@Service
public class DistributedOrderService {
    
    @Autowired
    @Qualifier("orderDataSource")
    private DataSource orderDataSource;
    
    @Autowired
    @Qualifier("inventoryDataSource")
    private DataSource inventoryDataSource;
    
    // XA transaction spans both databases
    @Transactional
    public void processDistributedOrder(CreateOrderRequest request) {
        // Operations on orders database
        try (Connection orderConn = orderDataSource.getConnection()) {
            PreparedStatement orderStmt = orderConn.prepareStatement(
                "INSERT INTO orders (customer_id, total_amount, status) VALUES (?, ?, ?)"
            );
            orderStmt.setString(1, request.getCustomerId());
            orderStmt.setBigDecimal(2, request.getTotalAmount());
            orderStmt.setString(3, "PENDING");
            orderStmt.executeUpdate();
        }
        
        // Operations on inventory database
        try (Connection inventoryConn = inventoryDataSource.getConnection()) {
            for (OrderItem item : request.getItems()) {
                PreparedStatement inventoryStmt = inventoryConn.prepareStatement(
                    "UPDATE inventory SET quantity = quantity - ? WHERE sku = ? AND quantity >= ?"
                );
                inventoryStmt.setInt(1, item.getQuantity());
                inventoryStmt.setString(2, item.getSku());
                inventoryStmt.setInt(3, item.getQuantity());
                
                int updatedRows = inventoryStmt.executeUpdate();
                if (updatedRows == 0) {
                    throw new InsufficientInventoryException("Not enough inventory for " + item.getSku());
                }
            }
        }
        
        // If we get here, both database operations succeeded
        // The XA transaction manager will coordinate the commit across both databases
    }
}

Go Distributed Transactions

Go doesn’t have built-in distributed transaction support, so we need to implement 2PC manually:

package distributed

import (
    "context"
    "database/sql"
    "fmt"
    "log"
    "time"
    
    "github.com/google/uuid"
)

type TransactionManager struct {
    resources []XAResource
}

type XAResource interface {
    Prepare(ctx context.Context, txID string) error
    Commit(ctx context.Context, txID string) error
    Rollback(ctx context.Context, txID string) error
}

type DatabaseResource struct {
    db   *sql.DB
    name string
}

func (r *DatabaseResource) Prepare(ctx context.Context, txID string) error {
    tx, err := r.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    
    // Store transaction for later commit/rollback
    // In production, you'd need a proper transaction store
    transactionStore[txID+"-"+r.name] = tx
    
    return nil
}

func (r *DatabaseResource) Commit(ctx context.Context, txID string) error {
    tx, exists := transactionStore[txID+"-"+r.name]
    if !exists {
        return fmt.Errorf("transaction not found: %s", txID)
    }
    
    err := tx.Commit()
    delete(transactionStore, txID+"-"+r.name)
    return err
}

func (r *DatabaseResource) Rollback(ctx context.Context, txID string) error {
    tx, exists := transactionStore[txID+"-"+r.name]
    if !exists {
        return nil // Already rolled back
    }
    
    err := tx.Rollback()
    delete(transactionStore, txID+"-"+r.name)
    return err
}

// Global transaction store (in production, use Redis or similar)
var transactionStore = make(map[string]*sql.Tx)

func (tm *TransactionManager) ExecuteDistributedTransaction(ctx context.Context, fn func() error) error {
    txID := uuid.New().String()
    
    // Phase 1: Prepare all resources
    for _, resource := range tm.resources {
        if err := resource.Prepare(ctx, txID); err != nil {
            // Rollback all prepared resources
            tm.rollbackAll(ctx, txID)
            return fmt.Errorf("prepare failed: %w", err)
        }
    }
    
    // Execute business logic
    if err := fn(); err != nil {
        tm.rollbackAll(ctx, txID)
        return fmt.Errorf("business logic failed: %w", err)
    }
    
    // Phase 2: Commit all resources
    for _, resource := range tm.resources {
        if err := resource.Commit(ctx, txID); err != nil {
            log.Printf("Commit failed for txID %s: %v", txID, err)
            // In production, you'd need a recovery mechanism here
            return fmt.Errorf("commit failed: %w", err)
        }
    }
    
    return nil
}

func (tm *TransactionManager) rollbackAll(ctx context.Context, txID string) {
    for _, resource := range tm.resources {
        if err := resource.Rollback(ctx, txID); err != nil {
            log.Printf("Rollback failed for txID %s: %v", txID, err)
        }
    }
}

// Usage example
func ProcessDistributedOrder(ctx context.Context, request CreateOrderRequest) error {
    orderDB, _ := sql.Open("mysql", "orders_connection_string")
    inventoryDB, _ := sql.Open("mysql", "inventory_connection_string")
    
    tm := &TransactionManager{
        resources: []XAResource{
            &DatabaseResource{db: orderDB, name: "orders"},
            &DatabaseResource{db: inventoryDB, name: "inventory"},
        },
    }
    
    return tm.ExecuteDistributedTransaction(ctx, func() error {
        // Business logic goes here - use the prepared transactions
        orderTx := transactionStore[txID+"-orders"]
        inventoryTx := transactionStore[txID+"-inventory"]
        
        // Create order
        _, err := orderTx.Exec(
            "INSERT INTO orders (customer_id, total_amount, status) VALUES (?, ?, ?)",
            request.CustomerID, request.TotalAmount, "PENDING",
        )
        if err != nil {
            return err
        }
        
        // Update inventory
        for _, item := range request.Items {
            result, err := inventoryTx.Exec(
                "UPDATE inventory SET quantity = quantity - ? WHERE sku = ? AND quantity >= ?",
                item.Quantity, item.SKU, item.Quantity,
            )
            if err != nil {
                return err
            }
            
            rowsAffected, _ := result.RowsAffected()
            if rowsAffected == 0 {
                return fmt.Errorf("insufficient inventory for %s", item.SKU)
            }
        }
        
        return nil
    })
}

Concurrency Control: Optimistic vs Pessimistic

Understanding when to use optimistic versus pessimistic concurrency control can make or break your application’s performance under load.

Pessimistic Locking: “Better Safe Than Sorry”

// Java/JPA pessimistic locking
@Service
public class AccountService {
    
    @Transactional
    public void transferFunds(String fromAccountId, String toAccountId, BigDecimal amount) {
        // Lock accounts in consistent order to prevent deadlocks
        String firstId = fromAccountId.compareTo(toAccountId) < 0 ? fromAccountId : toAccountId;
        String secondId = fromAccountId.compareTo(toAccountId) < 0 ? toAccountId : fromAccountId;
        
        Account firstAccount = accountRepository.findById(firstId, LockModeType.PESSIMISTIC_WRITE);
        Account secondAccount = accountRepository.findById(secondId, LockModeType.PESSIMISTIC_WRITE);
        
        Account fromAccount = fromAccountId.equals(firstId) ? firstAccount : secondAccount;
        Account toAccount = fromAccountId.equals(firstId) ? secondAccount : firstAccount;
        
        if (fromAccount.getBalance().compareTo(amount) < 0) {
            throw new InsufficientFundsException();
        }
        
        fromAccount.setBalance(fromAccount.getBalance().subtract(amount));
        toAccount.setBalance(toAccount.getBalance().add(amount));
        
        accountRepository.save(fromAccount);
        accountRepository.save(toAccount);
    }
}
// Go pessimistic locking with GORM
func (s *AccountService) TransferFunds(ctx context.Context, fromAccountID, toAccountID string, amount int64) error {
    return s.WithTransaction(func(tx *gorm.DB) error {
        var fromAccount, toAccount Account
        
        // Lock accounts in consistent order
        firstID, secondID := fromAccountID, toAccountID
        if fromAccountID > toAccountID {
            firstID, secondID = toAccountID, fromAccountID
        }
        
        // Lock first account
        if err := tx.Set("gorm:query_option", "FOR UPDATE").
            Where("id = ?", firstID).First(&fromAccount).Error; err != nil {
            return err
        }
        
        // Lock second account
        if err := tx.Set("gorm:query_option", "FOR UPDATE").
            Where("id = ?", secondID).First(&toAccount).Error; err != nil {
            return err
        }
        
        // Ensure we have the correct accounts
        if fromAccount.ID != fromAccountID {
            fromAccount, toAccount = toAccount, fromAccount
        }
        
        if fromAccount.Balance < amount {
            return fmt.Errorf("insufficient funds")
        }
        
        fromAccount.Balance -= amount
        toAccount.Balance += amount
        
        if err := tx.Save(&fromAccount).Error; err != nil {
            return err
        }
        
        return tx.Save(&toAccount).Error
    })
}

Optimistic Locking: “Hope for the Best, Handle the Rest”

// JPA optimistic locking with version fields
@Entity
public class Account {
    @Id
    private String id;
    
    private BigDecimal balance;
    
    @Version
    private Long version; // JPA automatically manages this
    
    // getters and setters...
}

@Service
public class OptimisticAccountService {
    
    @Transactional
    @Retryable(value = {OptimisticLockingFailureException.class}, maxAttempts = 3)
    public void transferFunds(String fromAccountId, String toAccountId, BigDecimal amount) {
        Account fromAccount = accountRepository.findById(fromAccountId);
        Account toAccount = accountRepository.findById(toAccountId);
        
        if (fromAccount.getBalance().compareTo(amount) < 0) {
            throw new InsufficientFundsException();
        }
        
        fromAccount.setBalance(fromAccount.getBalance().subtract(amount));
        toAccount.setBalance(toAccount.getBalance().add(amount));
        
        // If either account was modified by another transaction,
        // OptimisticLockingFailureException will be thrown
        accountRepository.save(fromAccount);
        accountRepository.save(toAccount);
    }
}
// Rust optimistic locking with version fields
#[derive(Queryable, Identifiable, AsChangeset)]
#[diesel(table_name = accounts)]
pub struct Account {
    pub id: String,
    pub balance: i64,
    pub version: i32,
}

impl AccountService {
    transactional! {
        fn transfer_funds_optimistic(from_account_id: String, to_account_id: String, amount: i64) -> () {
            use crate::schema::accounts::dsl::*;
            
            // Read current versions
            let from_account: Account = accounts
                .filter(id.eq(&from_account_id))
                .first(conn)
                .map_err(TransactionError::Database)?;
                
            let to_account: Account = accounts
                .filter(id.eq(&to_account_id))
                .first(conn)
                .map_err(TransactionError::Database)?;
            
            if from_account.balance < amount {
                return Err(TransactionError::Business("Insufficient funds".to_string()));
            }
            
            // Update with version check
            let from_updated = diesel::update(
                accounts
                    .filter(id.eq(&from_account_id))
                    .filter(version.eq(from_account.version))
            )
            .set((
                balance.eq(from_account.balance - amount),
                version.eq(from_account.version + 1)
            ))
            .execute(conn)
            .map_err(TransactionError::Database)?;
            
            if from_updated == 0 {
                return Err(TransactionError::Business("Concurrent modification detected".to_string()));
            }
            
            let to_updated = diesel::update(
                accounts
                    .filter(id.eq(&to_account_id))
                    .filter(version.eq(to_account.version))
            )
            .set((
                balance.eq(to_account.balance + amount),
                version.eq(to_account.version + 1)
            ))
            .execute(conn)
            .map_err(TransactionError::Database)?;
            
            if to_updated == 0 {
                return Err(TransactionError::Business("Concurrent modification detected".to_string()));
            }
            
            Ok(())
        }
    }
}

Distributed Transactions: SAGA Pattern

When 2PC becomes too heavyweight or you’re dealing with services that don’t support XA transactions, the SAGA pattern provides an elegant alternative using compensating transactions.

Command Pattern for Compensating Transactions

I initially applied this design pattern at a travel booking company in mid 2000 where we had to integrate with numerous external vendors—airline companies, hotels, car rental agencies, insurance providers, and activity booking services. Each vendor had different APIs, response times, and failure modes, but we needed to present customers with a single, atomic booking experience. The command pattern worked exceptionally well for this scenario. When a customer booked a vacation package, we’d execute a chain of commands: reserve flight, book hotel, rent car, purchase insurance. If any step failed midway through, we could automatically issue compensating transactions to undo the previous successful reservations. This approach delivered both excellent performance (operations could run in parallel where possible) and high reliability.

// Base interfaces for SAGA operations
public interface SagaCommand<T> {
    T execute() throws Exception;
    void compensate(T result) throws Exception;
}

public class SagaOrchestrator {
    
    public class SagaExecution<T> {
        private final SagaCommand<T> command;
        private T result;
        private boolean executed = false;
        
        public SagaExecution(SagaCommand<T> command) {
            this.command = command;
        }
        
        public T execute() throws Exception {
            result = command.execute();
            executed = true;
            return result;
        }
        
        public void compensate() throws Exception {
            if (executed && result != null) {
                command.compensate(result);
            }
        }
    }
    
    private final List<SagaExecution<?>> executions = new ArrayList<>();
    
    public <T> T execute(SagaCommand<T> command) throws Exception {
        SagaExecution<T> execution = new SagaExecution<>(command);
        executions.add(execution);
        return execution.execute();
    }
    
    public void compensateAll() {
        // Compensate in reverse order
        for (int i = executions.size() - 1; i >= 0; i--) {
            try {
                executions.get(i).compensate();
            } catch (Exception e) {
                log.error("Compensation failed", e);
                // In production, you'd need dead letter queue handling
            }
        }
    }
}

// Concrete command implementations
public class CreateOrderCommand implements SagaCommand<Order> {
    private final OrderService orderService;
    private final CreateOrderRequest request;
    
    public CreateOrderCommand(OrderService orderService, CreateOrderRequest request) {
        this.orderService = orderService;
        this.request = request;
    }
    
    @Override
    public Order execute() throws Exception {
        return orderService.createOrder(request);
    }
    
    @Override
    public void compensate(Order order) throws Exception {
        orderService.cancelOrder(order.getId());
    }
}

public class ProcessPaymentCommand implements SagaCommand<Payment> {
    private final PaymentService paymentService;
    private final String customerId;
    private final BigDecimal amount;
    
    @Override
    public Payment execute() throws Exception {
        return paymentService.processPayment(customerId, amount);
    }
    
    @Override
    public void compensate(Payment payment) throws Exception {
        paymentService.refundPayment(payment.getId());
    }
}

public class ReserveInventoryCommand implements SagaCommand<List<InventoryReservation>> {
    private final InventoryService inventoryService;
    private final List<OrderItem> items;
    
    @Override
    public List<InventoryReservation> execute() throws Exception {
        return inventoryService.reserveItems(items);
    }
    
    @Override
    public void compensate(List<InventoryReservation> reservations) throws Exception {
        for (InventoryReservation reservation : reservations) {
            inventoryService.releaseReservation(reservation.getId());
        }
    }
}

// Usage in service
@Service
public class SagaOrderService {
    
    public void processOrderWithSaga(CreateOrderRequest request) throws Exception {
        SagaOrchestrator saga = new SagaOrchestrator();
        
        try {
            // Execute commands in sequence
            Order order = saga.execute(new CreateOrderCommand(orderService, request));
            
            Payment payment = saga.execute(new ProcessPaymentCommand(
                paymentService, order.getCustomerId(), order.getTotalAmount()
            ));
            
            List<InventoryReservation> reservations = saga.execute(
                new ReserveInventoryCommand(inventoryService, request.getItems())
            );
            
            // If we get here, everything succeeded
            orderService.confirmOrder(order.getId());
            
        } catch (Exception e) {
            // Compensate all executed commands
            saga.compensateAll();
            throw e;
        }
    }
}

Persistent SAGA with State Machine

// SAGA state management
@Entity
public class SagaTransaction {
    @Id
    private String id;
    
    @Enumerated(EnumType.STRING)
    private SagaStatus status;
    
    private String currentStep;
    
    @ElementCollection
    private List<String> completedSteps = new ArrayList<>();
    
    @ElementCollection
    private List<String> compensatedSteps = new ArrayList<>();
    
    private String contextData; // JSON serialized context
    
    // getters/setters...
}

public enum SagaStatus {
    STARTED, IN_PROGRESS, COMPLETED, COMPENSATING, COMPENSATED, FAILED
}

@Component
public class PersistentSagaOrchestrator {
    
    @Autowired
    private SagaTransactionRepository sagaRepo;
    
    @Transactional
    public void executeSaga(String sagaId, List<SagaStep> steps) {
        SagaTransaction saga = sagaRepo.findById(sagaId)
            .orElse(new SagaTransaction(sagaId));
        
        try {
            for (SagaStep step : steps) {
                if (saga.getCompletedSteps().contains(step.getName())) {
                    continue; // Already completed
                }
                
                saga.setCurrentStep(step.getName());
                saga.setStatus(SagaStatus.IN_PROGRESS);
                sagaRepo.save(saga);
                
                // Execute step
                step.execute();
                
                saga.getCompletedSteps().add(step.getName());
                sagaRepo.save(saga);
            }
            
            saga.setStatus(SagaStatus.COMPLETED);
            sagaRepo.save(saga);
            
        } catch (Exception e) {
            compensateSaga(sagaId);
            throw e;
        }
    }
    
    @Transactional
    public void compensateSaga(String sagaId) {
        SagaTransaction saga = sagaRepo.findById(sagaId)
            .orElseThrow(() -> new IllegalArgumentException("SAGA not found"));
        
        saga.setStatus(SagaStatus.COMPENSATING);
        sagaRepo.save(saga);
        
        // Compensate in reverse order
        List<String> stepsToCompensate = new ArrayList<>(saga.getCompletedSteps());
        Collections.reverse(stepsToCompensate);
        
        for (String stepName : stepsToCompensate) {
            if (saga.getCompensatedSteps().contains(stepName)) {
                continue;
            }
            
            try {
                SagaStep step = findStepByName(stepName);
                step.compensate();
                saga.getCompensatedSteps().add(stepName);
                sagaRepo.save(saga);
            } catch (Exception e) {
                log.error("Compensation failed for step: " + stepName, e);
                saga.setStatus(SagaStatus.FAILED);
                sagaRepo.save(saga);
                return;
            }
        }
        
        saga.setStatus(SagaStatus.COMPENSATED);
        sagaRepo.save(saga);
    }
}

Go SAGA Implementation

// SAGA state machine in Go
package saga

import (
    "context"
    "encoding/json"
    "fmt"
    "time"
    
    "gorm.io/gorm"
)

type SagaStatus string

const (
    StatusStarted     SagaStatus = "STARTED"
    StatusInProgress  SagaStatus = "IN_PROGRESS"
    StatusCompleted   SagaStatus = "COMPLETED"
    StatusCompensating SagaStatus = "COMPENSATING"
    StatusCompensated SagaStatus = "COMPENSATED"
    StatusFailed      SagaStatus = "FAILED"
)

type SagaTransaction struct {
    ID               string     `gorm:"primarykey"`
    Status           SagaStatus
    CurrentStep      string
    CompletedSteps   string // JSON array
    CompensatedSteps string // JSON array
    ContextData      string // JSON context
    CreatedAt        time.Time
    UpdatedAt        time.Time
}

type SagaStep interface {
    Name() string
    Execute(ctx context.Context, sagaContext map[string]interface{}) error
    Compensate(ctx context.Context, sagaContext map[string]interface{}) error
}

type SagaOrchestrator struct {
    db *gorm.DB
}

func NewSagaOrchestrator(db *gorm.DB) *SagaOrchestrator {
    return &SagaOrchestrator{db: db}
}

func (o *SagaOrchestrator) ExecuteSaga(ctx context.Context, sagaID string, steps []SagaStep, context map[string]interface{}) error {
    return o.db.Transaction(func(tx *gorm.DB) error {
        var saga SagaTransaction
        if err := tx.First(&saga, "id = ?", sagaID).Error; err != nil {
            if err == gorm.ErrRecordNotFound {
                // Create new saga
                contextJSON, _ := json.Marshal(context)
                saga = SagaTransaction{
                    ID:          sagaID,
                    Status:      StatusStarted,
                    ContextData: string(contextJSON),
                }
                if err := tx.Create(&saga).Error; err != nil {
                    return err
                }
            } else {
                return err
            }
        }
        
        // Parse completed steps
        var completedSteps []string
        if saga.CompletedSteps != "" {
            json.Unmarshal([]byte(saga.CompletedSteps), &completedSteps)
        }
        
        completedMap := make(map[string]bool)
        for _, step := range completedSteps {
            completedMap[step] = true
        }
        
        // Execute steps
        for _, step := range steps {
            if completedMap[step.Name()] {
                continue // Already completed
            }
            
            saga.CurrentStep = step.Name()
            saga.Status = StatusInProgress
            if err := tx.Save(&saga).Error; err != nil {
                return err
            }
            
            // Parse saga context
            var sagaContext map[string]interface{}
            json.Unmarshal([]byte(saga.ContextData), &sagaContext)
            
            // Execute step
            if err := step.Execute(ctx, sagaContext); err != nil {
                // Start compensation
                return o.compensateSaga(ctx, tx, sagaID, steps[:len(completedSteps)])
            }
            
            // Mark step as completed
            completedSteps = append(completedSteps, step.Name())
            completedJSON, _ := json.Marshal(completedSteps)
            saga.CompletedSteps = string(completedJSON)
            
            // Update context if modified
            updatedContext, _ := json.Marshal(sagaContext)
            saga.ContextData = string(updatedContext)
            
            if err := tx.Save(&saga).Error; err != nil {
                return err
            }
        }
        
        saga.Status = StatusCompleted
        return tx.Save(&saga).Error
    })
}

func (o *SagaOrchestrator) compensateSaga(ctx context.Context, tx *gorm.DB, sagaID string, completedSteps []SagaStep) error {
    var saga SagaTransaction
    if err := tx.First(&saga, "id = ?", sagaID).Error; err != nil {
        return err
    }
    
    saga.Status = StatusCompensating
    if err := tx.Save(&saga).Error; err != nil {
        return err
    }
    
    // Parse compensated steps
    var compensatedSteps []string
    if saga.CompensatedSteps != "" {
        json.Unmarshal([]byte(saga.CompensatedSteps), &compensatedSteps)
    }
    
    compensatedMap := make(map[string]bool)
    for _, step := range compensatedSteps {
        compensatedMap[step] = true
    }
    
    // Compensate in reverse order
    for i := len(completedSteps) - 1; i >= 0; i-- {
        step := completedSteps[i]
        
        if compensatedMap[step.Name()] {
            continue
        }
        
        var sagaContext map[string]interface{}
        json.Unmarshal([]byte(saga.ContextData), &sagaContext)
        
        if err := step.Compensate(ctx, sagaContext); err != nil {
            saga.Status = StatusFailed
            tx.Save(&saga)
            return fmt.Errorf("compensation failed for step %s: %w", step.Name(), err)
        }
        
        compensatedSteps = append(compensatedSteps, step.Name())
        compensatedJSON, _ := json.Marshal(compensatedSteps)
        saga.CompensatedSteps = string(compensatedJSON)
        
        if err := tx.Save(&saga).Error; err != nil {
            return err
        }
    }
    
    saga.Status = StatusCompensated
    return tx.Save(&saga).Error
}

// Concrete step implementations
type CreateOrderStep struct {
    orderService *OrderService
    request      CreateOrderRequest
}

func (s *CreateOrderStep) Name() string {
    return "CREATE_ORDER"
}

func (s *CreateOrderStep) Execute(ctx context.Context, sagaContext map[string]interface{}) error {
    order, err := s.orderService.CreateOrder(ctx, s.request)
    if err != nil {
        return err
    }
    
    // Store order ID in context for later steps
    sagaContext["orderId"] = order.ID
    return nil
}

func (s *CreateOrderStep) Compensate(ctx context.Context, sagaContext map[string]interface{}) error {
    if orderID, exists := sagaContext["orderId"]; exists {
        return s.orderService.CancelOrder(ctx, orderID.(uint))
    }
    return nil
}

// Usage
func ProcessOrderWithSaga(ctx context.Context, orchestrator *SagaOrchestrator, request CreateOrderRequest) error {
    sagaID := uuid.New().String()
    
    steps := []SagaStep{
        &CreateOrderStep{orderService, request},
        &ProcessPaymentStep{paymentService, request.CustomerID, request.TotalAmount},
        &ReserveInventoryStep{inventoryService, request.Items},
    }
    
    context := map[string]interface{}{
        "customerId": request.CustomerID,
        "requestId":  request.RequestID,
    }
    
    return orchestrator.ExecuteSaga(ctx, sagaID, steps, context)
}

The Dual-Write Problem: Database + Events

One of the most insidious transaction problems occurs when you need to both update the database and publish an event. I’ve debugged countless production issues where customers received order confirmations but no order existed in the database, or orders were created but notification events never fired.

The Anti-Pattern: Sequential Operations

// THIS IS FUNDAMENTALLY BROKEN - DON'T DO THIS
@Service
public class BrokenOrderService {
    
    @Transactional
    public void processOrder(CreateOrderRequest request) {
        Order order = orderRepository.save(new Order(request));
        
        // DANGER: Event published outside transaction boundary
        eventPublisher.publishEvent(new OrderCreatedEvent(order));
        
        // What if this line throws an exception?
        // Event is already published but transaction will rollback!
    }
    
    // ALSO BROKEN: Event first, then database
    @Transactional  
    public void processOrderEventFirst(CreateOrderRequest request) {
        Order order = new Order(request);
        
        // DANGER: Event published before persistence
        eventPublisher.publishEvent(new OrderCreatedEvent(order));
        
        // What if database save fails?
        // Event consumers will query for order that doesn't exist!
        orderRepository.save(order);
    }
}

Solution 1: Transactional Outbox Pattern

I have used Outbox pattern in a number of applications especially for sending notifications to users where instead of directly sending an event to a queue, the messages are stored in the database and then relayed to external service like Apple Push Notification Service (APNs) or Google Push Notification Service (FCM).

// Outbox event entity
@Entity
@Table(name = "outbox_events")
public class OutboxEvent {
    @Id
    private String id;
    
    @Column(name = "event_type")
    private String eventType;
    
    @Column(name = "payload", columnDefinition = "TEXT")
    private String payload;
    
    @Column(name = "created_at")
    private Instant createdAt;
    
    @Column(name = "processed_at")
    private Instant processedAt;
    
    @Enumerated(EnumType.STRING)
    private OutboxStatus status;
    
    // constructors, getters, setters...
}

public enum OutboxStatus {
    PENDING, PROCESSED, FAILED
}

// Outbox repository
@Repository
public interface OutboxEventRepository extends JpaRepository<OutboxEvent, String> {
    @Query("SELECT e FROM OutboxEvent e WHERE e.status = :status ORDER BY e.createdAt ASC")
    List<OutboxEvent> findByStatusOrderByCreatedAt(@Param("status") OutboxStatus status);
    
    @Modifying
    @Query("UPDATE OutboxEvent e SET e.status = :status, e.processedAt = :processedAt WHERE e.id = :id")
    void updateStatus(@Param("id") String id, @Param("status") OutboxStatus status, @Param("processedAt") Instant processedAt);
}

// Corrected order service using outbox
@Service
public class TransactionalOrderService {
    
    @Autowired
    private OrderRepository orderRepository;
    
    @Autowired
    private OutboxEventRepository outboxRepository;
    
    @Transactional
    public void processOrder(CreateOrderRequest request) {
        // 1. Process business logic
        Order order = new Order(request);
        order = orderRepository.save(order);
        
        // 2. Store event in same transaction
        OutboxEvent event = new OutboxEvent(
            UUID.randomUUID().toString(),
            "OrderCreated",
            serializeEvent(new OrderCreatedEvent(order)),
            Instant.now(),
            OutboxStatus.PENDING
        );
        outboxRepository.save(event);
        
        // Both order and event are committed atomically!
    }
    
    private String serializeEvent(Object event) {
        try {
            return objectMapper.writeValueAsString(event);
        } catch (JsonProcessingException e) {
            throw new RuntimeException("Event serialization failed", e);
        }
    }
}

// Event relay service
@Component
public class OutboxEventRelay {
    
    @Autowired
    private OutboxEventRepository outboxRepository;
    
    @Autowired
    private ApplicationEventPublisher eventPublisher;
    
    @Scheduled(fixedDelay = 1000) // Poll every second
    @Transactional
    public void processOutboxEvents() {
        List<OutboxEvent> pendingEvents = outboxRepository
            .findByStatusOrderByCreatedAt(OutboxStatus.PENDING);
        
        for (OutboxEvent outboxEvent : pendingEvents) {
            try {
                // Deserialize and publish the event
                Object event = deserializeEvent(outboxEvent.getEventType(), outboxEvent.getPayload());
                eventPublisher.publishEvent(event);
                
                // Mark as processed
                outboxRepository.updateStatus(
                    outboxEvent.getId(), 
                    OutboxStatus.PROCESSED, 
                    Instant.now()
                );
                
            } catch (Exception e) {
                log.error("Failed to process outbox event: " + outboxEvent.getId(), e);
                outboxRepository.updateStatus(
                    outboxEvent.getId(), 
                    OutboxStatus.FAILED, 
                    Instant.now()
                );
            }
        }
    }
}

Go Implementation: Outbox Pattern with GORM

// Outbox event model
type OutboxEvent struct {
    ID          string     `gorm:"primarykey"`
    EventType   string     `gorm:"not null"`
    Payload     string     `gorm:"type:text;not null"`
    CreatedAt   time.Time
    ProcessedAt *time.Time
    Status      OutboxStatus `gorm:"type:varchar(20);default:'PENDING'"`
}

type OutboxStatus string

const (
    OutboxStatusPending   OutboxStatus = "PENDING"
    OutboxStatusProcessed OutboxStatus = "PROCESSED" 
    OutboxStatusFailed    OutboxStatus = "FAILED"
)

// Service with outbox pattern
type OrderService struct {
    db           *gorm.DB
    eventRelay   *OutboxEventRelay
}

func (s *OrderService) ProcessOrder(ctx context.Context, request CreateOrderRequest) (*Order, error) {
    var order *Order
    
    err := s.db.Transaction(func(tx *gorm.DB) error {
        // 1. Create order
        order = &Order{
            CustomerID:  request.CustomerID,
            TotalAmount: request.TotalAmount,
            Status:      "CONFIRMED",
        }
        
        if err := tx.Create(order).Error; err != nil {
            return fmt.Errorf("failed to create order: %w", err)
        }
        
        // 2. Store event in same transaction
        eventPayload, err := json.Marshal(OrderCreatedEvent{
            OrderID:     order.ID,
            CustomerID:  order.CustomerID,
            TotalAmount: order.TotalAmount,
        })
        if err != nil {
            return fmt.Errorf("failed to serialize event: %w", err)
        }
        
        outboxEvent := &OutboxEvent{
            ID:        uuid.New().String(),
            EventType: "OrderCreated",
            Payload:   string(eventPayload),
            CreatedAt: time.Now(),
            Status:    OutboxStatusPending,
        }
        
        if err := tx.Create(outboxEvent).Error; err != nil {
            return fmt.Errorf("failed to store outbox event: %w", err)
        }
        
        return nil
    })
    
    return order, err
}

// Event relay service
type OutboxEventRelay struct {
    db            *gorm.DB
    eventPublisher EventPublisher
    ticker         *time.Ticker
    done           chan bool
}

func NewOutboxEventRelay(db *gorm.DB, publisher EventPublisher) *OutboxEventRelay {
    return &OutboxEventRelay{
        db:             db,
        eventPublisher: publisher,
        ticker:         time.NewTicker(1 * time.Second),
        done:           make(chan bool),
    }
}

func (r *OutboxEventRelay) Start(ctx context.Context) {
    go func() {
        for {
            select {
            case <-r.ticker.C:
                r.processOutboxEvents(ctx)
            case <-r.done:
                return
            case <-ctx.Done():
                return
            }
        }
    }()
}

func (r *OutboxEventRelay) processOutboxEvents(ctx context.Context) {
    var events []OutboxEvent
    
    // Find pending events
    if err := r.db.Where("status = ?", OutboxStatusPending).
        Order("created_at ASC").
        Limit(100).
        Find(&events).Error; err != nil {
        log.Printf("Failed to fetch outbox events: %v", err)
        return
    }
    
    for _, event := range events {
        if err := r.processEvent(ctx, event); err != nil {
            log.Printf("Failed to process event %s: %v", event.ID, err)
            
            // Mark as failed
            now := time.Now()
            r.db.Model(&event).Updates(OutboxEvent{
                Status:      OutboxStatusFailed,
                ProcessedAt: &now,
            })
        } else {
            // Mark as processed
            now := time.Now()
            r.db.Model(&event).Updates(OutboxEvent{
                Status:      OutboxStatusProcessed,
                ProcessedAt: &now,
            })
        }
    }
}

func (r *OutboxEventRelay) processEvent(ctx context.Context, event OutboxEvent) error {
    // Deserialize and publish the event
    switch event.EventType {
    case "OrderCreated":
        var orderEvent OrderCreatedEvent
        if err := json.Unmarshal([]byte(event.Payload), &orderEvent); err != nil {
            return err
        }
        return r.eventPublisher.Publish(ctx, orderEvent)
        
    default:
        return fmt.Errorf("unknown event type: %s", event.EventType)
    }
}

Rust Implementation: Outbox with Diesel

// Outbox event model
use diesel::prelude::*;
use serde::{Deserialize, Serialize};
use chrono::{DateTime, Utc};

#[derive(Debug, Clone, Queryable, Insertable, Serialize, Deserialize)]
#[diesel(table_name = outbox_events)]
pub struct OutboxEvent {
    pub id: String,
    pub event_type: String,
    pub payload: String,
    pub created_at: DateTime<Utc>,
    pub processed_at: Option<DateTime<Utc>>,
    pub status: OutboxStatus,
}

#[derive(Debug, Clone, Serialize, Deserialize, diesel_derive_enum::DbEnum)]
#[ExistingTypePath = "crate::schema::sql_types::OutboxStatus"]
pub enum OutboxStatus {
    Pending,
    Processed,
    Failed,
}

// Service with outbox
impl OrderService {
    transactional! {
        fn process_order_with_outbox(request: CreateOrderRequest) -> Order {
            use crate::schema::orders::dsl::*;
            use crate::schema::outbox_events::dsl::*;
            
            // 1. Create order
            let new_order = NewOrder {
                customer_id: &request.customer_id,
                total_amount: request.total_amount,
                status: "CONFIRMED",
            };
            
            let order: Order = diesel::insert_into(orders)
                .values(&new_order)
                .get_result(conn)
                .map_err(TransactionError::Database)?;
            
            // 2. Create event in same transaction
            let event_payload = serde_json::to_string(&OrderCreatedEvent {
                order_id: order.id,
                customer_id: order.customer_id.clone(),
                total_amount: order.total_amount,
            }).map_err(|e| TransactionError::Business(format!("Event serialization failed: {}", e)))?;
            
            let outbox_event = OutboxEvent {
                id: uuid::Uuid::new_v4().to_string(),
                event_type: "OrderCreated".to_string(),
                payload: event_payload,
                created_at: Utc::now(),
                processed_at: None,
                status: OutboxStatus::Pending,
            };
            
            diesel::insert_into(outbox_events)
                .values(&outbox_event)
                .execute(conn)
                .map_err(TransactionError::Database)?;
            
            Ok(order)
        }
    }
}

// Event relay service
pub struct OutboxEventRelay {
    pool: Pool<ConnectionManager<PgConnection>>,
    event_publisher: Arc<dyn EventPublisher>,
}

impl OutboxEventRelay {
    pub async fn start(&self, mut shutdown: tokio::sync::broadcast::Receiver<()>) {
        let mut interval = tokio::time::interval(Duration::from_secs(1));
        
        loop {
            tokio::select! {
                _ = interval.tick() => {
                    if let Err(e) = self.process_outbox_events().await {
                        tracing::error!("Failed to process outbox events: {}", e);
                    }
                }
                _ = shutdown.recv() => {
                    tracing::info!("Outbox relay shutting down");
                    break;
                }
            }
        }
    }
    
    async fn process_outbox_events(&self) -> Result<(), Box<dyn std::error::Error>> {
        use crate::schema::outbox_events::dsl::*;
        
        let mut conn = self.pool.get()?;
        
        let pending_events: Vec<OutboxEvent> = outbox_events
            .filter(status.eq(OutboxStatus::Pending))
            .order(created_at.asc())
            .limit(100)
            .load(&mut conn)?;
        
        for event in pending_events {
            match self.process_single_event(&event).await {
                Ok(_) => {
                    // Mark as processed
                    let now = Utc::now();
                    diesel::update(outbox_events.filter(id.eq(&event.id)))
                        .set((
                            status.eq(OutboxStatus::Processed),
                            processed_at.eq(Some(now))
                        ))
                        .execute(&mut conn)?;
                }
                Err(e) => {
                    tracing::error!("Failed to process event {}: {}", event.id, e);
                    let now = Utc::now();
                    diesel::update(outbox_events.filter(id.eq(&event.id)))
                        .set((
                            status.eq(OutboxStatus::Failed),
                            processed_at.eq(Some(now))
                        ))
                        .execute(&mut conn)?;
                }
            }
        }
        
        Ok(())
    }
}

Solution 2: Change Data Capture (CDC)

I have used this pattern extensively for high-throughput systems where polling-based outbox patterns couldn’t keep up with the data volume. My first implementation was in the early 2000s when building an intelligent traffic management system, where I used CDC to synchronize data between two subsystems—all database changes were captured and published to a JMS queue, with consumers updating their local databases in near real-time. In subsequent projects, I used CDC to publish database changes directly to Kafka topics, enabling downstream services to build analytical systems, populate data lakes, and power real-time reporting dashboards.

As CDC eliminates the polling overhead entirely, this approach can be scaled to millions of transactions per day while maintaining sub-second latency for downstream consumers.

// CDC-based event publisher using Debezium
@Component
public class CDCEventHandler {
    
    @Autowired
    private KafkaTemplate<String, String> kafkaTemplate;
    
    @KafkaListener(topics = "dbserver.public.orders")
    public void handleOrderChange(ConsumerRecord<String, String> record) {
        try {
            // Parse CDC event
            JsonNode changeEvent = objectMapper.readTree(record.value());
            String operation = changeEvent.get("op").asText(); // c=create, u=update, d=delete
            
            if ("c".equals(operation)) {
                JsonNode after = changeEvent.get("after");
                
                OrderCreatedEvent event = OrderCreatedEvent.builder()
                    .orderId(after.get("id").asLong())
                    .customerId(after.get("customer_id").asText())
                    .totalAmount(after.get("total_amount").asLong())
                    .build();
                
                // Publish to business event topic
                kafkaTemplate.send("order-events", 
                    event.getOrderId().toString(), 
                    objectMapper.writeValueAsString(event));
            }
            
        } catch (Exception e) {
            log.error("Failed to process CDC event", e);
        }
    }
}

Solution 3: Event Sourcing

I have often used this pattern with financial systems in conjunction with CQRS, where all changes are stored as immutable events and can be replayed if needed. This approach is particularly valuable in financial contexts because it provides a complete audit trail—every account balance change, every trade, every adjustment can be traced back to its originating event.

For systems where events are the source of truth:

// Event store as the primary persistence
@Entity
public class EventStore {
    @Id
    private String eventId;
    
    private String aggregateId;
    private String eventType;
    private String eventData;
    private Long version;
    private Instant timestamp;
    
    // getters, setters...
}

@Service
public class EventSourcedOrderService {
    
    @Transactional
    public void processOrder(CreateOrderRequest request) {
        String orderId = UUID.randomUUID().toString();
        
        // Store event - this IS the transaction
        OrderCreatedEvent event = new OrderCreatedEvent(orderId, request);
        
        EventStore eventRecord = new EventStore(
            UUID.randomUUID().toString(),
            orderId,
            "OrderCreated",
            objectMapper.writeValueAsString(event),
            1L,
            Instant.now()
        );
        
        eventStoreRepository.save(eventRecord);
        
        // Publish immediately - event is already persisted
        eventPublisher.publishEvent(event);
        
        // Read model is updated asynchronously via event handlers
    }
}

Database Connection Pooling & Transaction Isolation

One of the most overlooked sources of transaction problems is the mismatch between application threads and database connections. I’ve debugged production deadlocks that turned out to be caused by connection pool starvation rather than actual database contention.

The Thread-Pool vs Connection-Pool Mismatch

// DANGEROUS CONFIGURATION - This will cause deadlocks
@Configuration
public class ProblematicDataSourceConfig {
    
    @Bean
    public DataSource dataSource() {
        HikariConfig config = new HikariConfig();
        config.setMaximumPoolSize(10);        // Only 10 connections
        config.setConnectionTimeout(30000);   // 30 second timeout
        return new HikariDataSource(config);
    }
    
    @Bean
    public Executor taskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(50);         // 50 threads!
        executor.setMaxPoolSize(100);         // Up to 100 threads!
        return executor;
    }
}

// This service will cause connection pool starvation
@Service
public class ProblematicBulkService {
    
    @Async
    @Transactional
    public CompletableFuture<Void> processBatch(List<Order> orders) {
        // 100 threads trying to get 10 connections = deadlock
        for (Order order : orders) {
            // Long-running transaction holds connection
            processOrder(order);
            
            // Calling another transactional method = needs another connection
            auditService.logOrderProcessing(order); // DEADLOCK RISK!
        }
        return CompletableFuture.completedFuture(null);
    }
}

Correct Connection Pool Configuration

@Configuration
public class OptimalDataSourceConfig {
    
    @Value("${app.database.max-connections:20}")
    private int maxConnections;
    
    @Bean
    public DataSource dataSource() {
        HikariConfig config = new HikariConfig();
        
        // Rule of thumb: connections >= active threads + buffer
        config.setMaximumPoolSize(maxConnections);
        config.setMinimumIdle(5);
        
        // Prevent connection hoarding
        config.setConnectionTimeout(5000);     // 5 seconds
        config.setIdleTimeout(300000);         // 5 minutes
        config.setMaxLifetime(1200000);        // 20 minutes
        
        // Detect leaked connections
        config.setLeakDetectionThreshold(60000); // 1 minute
        
        return new HikariDataSource(config);
    }
    
    @Bean
    public Executor taskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        // Keep thread pool smaller than connection pool
        executor.setCorePoolSize(maxConnections - 5);
        executor.setMaxPoolSize(maxConnections);
        executor.setQueueCapacity(100);
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        return executor;
    }
}

// Connection-aware bulk processing
@Service
public class OptimalBulkService {
    
    @Transactional
    public void processBatchOptimized(List<Order> orders) {
        int batchSize = 50; // Tune based on connection pool size
        
        for (int i = 0; i < orders.size(); i += batchSize) {
            List<Order> batch = orders.subList(i, Math.min(i + batchSize, orders.size()));
            
            // Process batch in single transaction
            processBatchInternal(batch);
            
            // Flush and clear to prevent memory issues
            entityManager.flush();
            entityManager.clear();
        }
    }
    
    private void processBatchInternal(List<Order> batch) {
        // Bulk operations to reduce connection time
        orderRepository.saveAll(batch);
        
        // Batch audit logging
        List<AuditLog> auditLogs = batch.stream()
            .map(order -> new AuditLog("ORDER_PROCESSED", order.getId()))
            .collect(Collectors.toList());
        auditRepository.saveAll(auditLogs);
    }
}

ETL Transaction Boundaries: The Performance Killer

Object-Relational (OR) mapping and automated transactions simplify development, but I have encountered countless performance issues where the same patterns were used for ETL processing or importing data. For example, I’ve seen code where developers used Hibernate to import millions of records, and it took hours to process data that should have completed in minutes. In other cases, I saw transactions committed after inserting each record—an import job that should have taken an hour ended up running for days.

These performance issues stem from misunderstanding the fundamental differences between OLTP (Online Transaction Processing) and OLAP/ETL (Online Analytical Processing) workloads. OLTP patterns optimize for individual record operations with strict ACID guarantees, while ETL patterns optimize for bulk operations with relaxed consistency requirements. Care must be taken to understand these tradeoffs and choose the right approach for each scenario.

// WRONG - Each record in its own transaction (SLOW!)
@Service
public class SlowETLService {
    
    public void importOrders(InputStream csvFile) {
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(csvFile))) {
            String line;
            while ((line = reader.readLine()) != null) {
                processOrderRecord(line); // Each call = new transaction!
            }
        }
    }
    
    @Transactional // New transaction per record = DISASTER
    private void processOrderRecord(String csvLine) {
        Order order = parseOrderFromCsv(csvLine);
        orderRepository.save(order);
        // Commit happens here - thousands of commits!
    }
}

// CORRECT - Batch processing with optimal transaction boundaries
@Service 
public class FastETLService {
    
    @Value("${etl.batch-size:1000}")
    private int batchSize;
    
    public void importOrdersOptimized(InputStream csvFile) {
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(csvFile))) {
            List<Order> batch = new ArrayList<>(batchSize);
            String line;
            
            while ((line = reader.readLine()) != null) {
                Order order = parseOrderFromCsv(line);
                batch.add(order);
                
                if (batch.size() >= batchSize) {
                    processBatch(batch);
                    batch.clear();
                }
            }
            
            // Process remaining records
            if (!batch.isEmpty()) {
                processBatch(batch);
            }
        }
    }
    
    @Transactional
    private void processBatch(List<Order> orders) {
        // Single transaction for entire batch
        orderRepository.saveAll(orders);
        
        // Bulk validation and error handling
        validateBatch(orders);
        
        // Bulk audit logging
        createAuditEntries(orders);
    }
}

Go Connection Pool Management

// Connection pool configuration with proper sizing
func setupDatabase() *gorm.DB {
    dsn := "host=localhost user=postgres password=secret dbname=orders port=5432 sslmode=disable"
    
    db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{})
    if err != nil {
        log.Fatal("Failed to connect to database:", err)
    }
    
    sqlDB, err := db.DB()
    if err != nil {
        log.Fatal("Failed to get SQL DB:", err)
    }
    
    // Critical: Match pool size to application concurrency
    sqlDB.SetMaxOpenConns(20)        // Maximum connections
    sqlDB.SetMaxIdleConns(5)         // Idle connections
    sqlDB.SetConnMaxLifetime(time.Hour) // Connection lifetime
    sqlDB.SetConnMaxIdleTime(time.Minute * 30) // Idle timeout
    
    return db
}

// ETL with proper batching
type ETLService struct {
    db        *gorm.DB
    batchSize int
}

func (s *ETLService) ImportOrders(csvFile io.Reader) error {
    scanner := bufio.NewScanner(csvFile)
    batch := make([]*Order, 0, s.batchSize)
    
    for scanner.Scan() {
        order, err := s.parseOrderFromCSV(scanner.Text())
        if err != nil {
            log.Printf("Failed to parse order: %v", err)
            continue
        }
        
        batch = append(batch, order)
        
        if len(batch) >= s.batchSize {
            if err := s.processBatch(batch); err != nil {
                return fmt.Errorf("batch processing failed: %w", err)
            }
            batch = batch[:0] // Reset slice but keep capacity
        }
    }
    
    // Process remaining orders
    if len(batch) > 0 {
        return s.processBatch(batch)
    }
    
    return scanner.Err()
}

func (s *ETLService) processBatch(orders []*Order) error {
    return s.db.Transaction(func(tx *gorm.DB) error {
        // Batch insert for performance
        if err := tx.CreateInBatches(orders, len(orders)).Error; err != nil {
            return err
        }
        
        // Bulk audit logging in same transaction
        auditLogs := make([]*AuditLog, len(orders))
        for i, order := range orders {
            auditLogs[i] = &AuditLog{
                Action:    "ORDER_IMPORTED",
                OrderID:   order.ID,
                Timestamp: time.Now(),
            }
        }
        
        return tx.CreateInBatches(auditLogs, len(auditLogs)).Error
    })
}

Deadlock Detection and Resolution

Deadlocks are inevitable in high-concurrency systems. The key is detecting and handling them gracefully:

Database-Level Deadlock Prevention

@Service
public class DeadlockAwareAccountService {
    
    @Retryable(
        value = {CannotAcquireLockException.class, DeadlockLoserDataAccessException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 100, multiplier = 2, random = true)
    )
    @Transactional(isolation = Isolation.READ_COMMITTED)
    public void transferFunds(String fromAccountId, String toAccountId, BigDecimal amount) {
        
        // CRITICAL: Always acquire locks in consistent order to prevent deadlocks
        String firstId = fromAccountId.compareTo(toAccountId) < 0 ? fromAccountId : toAccountId;
        String secondId = fromAccountId.compareTo(toAccountId) < 0 ? toAccountId : fromAccountId;
        
        // Lock accounts in alphabetical order
        Account firstAccount = accountRepository.findByIdForUpdate(firstId);
        Account secondAccount = accountRepository.findByIdForUpdate(secondId);
        
        // Determine which is from/to after locking
        Account fromAccount = fromAccountId.equals(firstId) ? firstAccount : secondAccount;
        Account toAccount = fromAccountId.equals(firstId) ? secondAccount : firstAccount;
        
        if (fromAccount.getBalance().compareTo(amount) < 0) {
            throw new InsufficientFundsException();
        }
        
        fromAccount.setBalance(fromAccount.getBalance().subtract(amount));
        toAccount.setBalance(toAccount.getBalance().add(amount));
        
        accountRepository.save(fromAccount);
        accountRepository.save(toAccount);
    }
}

// Custom repository with explicit locking
@Repository
public interface AccountRepository extends JpaRepository<Account, String> {
    
    @Query("SELECT a FROM Account a WHERE a.id = :id")
    @Lock(LockModeType.PESSIMISTIC_WRITE)
    Account findByIdForUpdate(@Param("id") String id);
    
    // Timeout-based locking to prevent indefinite waits
    @QueryHints({@QueryHint(name = "javax.persistence.lock.timeout", value = "5000")})
    @Query("SELECT a FROM Account a WHERE a.id = :id")
    @Lock(LockModeType.PESSIMISTIC_WRITE)
    Account findByIdForUpdateWithTimeout(@Param("id") String id);
}

Application-Level Deadlock Detection

// Deadlock-aware service with timeout and retry
type AccountService struct {
    db           *gorm.DB
    lockTimeout  time.Duration
    retryAttempts int
}

func (s *AccountService) TransferFundsWithDeadlockHandling(
    ctx context.Context, 
    fromAccountID, toAccountID string, 
    amount int64,
) error {
    
    for attempt := 0; attempt < s.retryAttempts; attempt++ {
        err := s.attemptTransfer(ctx, fromAccountID, toAccountID, amount)
        
        if err == nil {
            return nil // Success
        }
        
        // Check if it's a deadlock or timeout error
        if s.isRetryableError(err) {
            // Exponential backoff with jitter
            backoff := time.Duration(attempt+1) * 100 * time.Millisecond
            jitter := time.Duration(rand.Intn(100)) * time.Millisecond
            
            select {
            case <-time.After(backoff + jitter):
                continue // Retry
            case <-ctx.Done():
                return ctx.Err()
            }
        }
        
        return err // Non-retryable error
    }
    
    return fmt.Errorf("transfer failed after %d attempts", s.retryAttempts)
}

func (s *AccountService) attemptTransfer(
    ctx context.Context, 
    fromAccountID, toAccountID string, 
    amount int64,
) error {
    
    // Set transaction timeout
    timeoutCtx, cancel := context.WithTimeout(ctx, s.lockTimeout)
    defer cancel()
    
    return s.db.WithContext(timeoutCtx).Transaction(func(tx *gorm.DB) error {
        // Lock in consistent order
        firstID, secondID := fromAccountID, toAccountID
        if fromAccountID > toAccountID {
            firstID, secondID = toAccountID, fromAccountID
        }
        
        var fromAccount, toAccount Account
        
        // Lock first account
        if err := tx.Set("gorm:query_option", "FOR UPDATE NOWAIT").
            Where("id = ?", firstID).First(&fromAccount).Error; err != nil {
            return err
        }
        
        // Lock second account
        if err := tx.Set("gorm:query_option", "FOR UPDATE NOWAIT").
            Where("id = ?", secondID).First(&toAccount).Error; err != nil {
            return err
        }
        
        // Ensure we have the right accounts
        if fromAccount.ID != fromAccountID {
            fromAccount, toAccount = toAccount, fromAccount
        }
        
        if fromAccount.Balance < amount {
            return fmt.Errorf("insufficient funds")
        }
        
        fromAccount.Balance -= amount
        toAccount.Balance += amount
        
        if err := tx.Save(&fromAccount).Error; err != nil {
            return err
        }
        
        return tx.Save(&toAccount).Error
    })
}

func (s *AccountService) isRetryableError(err error) bool {
    errStr := strings.ToLower(err.Error())
    
    // Common deadlock/timeout indicators
    retryablePatterns := []string{
        "deadlock detected",
        "lock timeout",
        "could not obtain lock",
        "serialization failure",
        "concurrent update",
    }
    
    for _, pattern := range retryablePatterns {
        if strings.Contains(errStr, pattern) {
            return true
        }
    }
    
    return false
}

CQRS: Separating Read and Write Transaction Models

I have also used Command Query Responsibility Segregation pattern in some financial applications that allows different transaction strategies for reads vs writes:

Java CQRS Implementation

// Command side - strict ACID transactions
@Service
public class OrderCommandService {
    
    @Autowired
    private OrderRepository orderRepository;
    
    @Autowired
    private EventPublisher eventPublisher;
    
    @Transactional(isolation = Isolation.SERIALIZABLE)
    public void createOrder(CreateOrderCommand command) {
        // Validate command
        validateOrderCommand(command);
        
        Order order = new Order(command);
        order = orderRepository.save(order);
        
        // Publish event for read model update
        eventPublisher.publish(new OrderCreatedEvent(order));
    }
    
    @Transactional(isolation = Isolation.READ_COMMITTED)
    public void updateOrderStatus(UpdateOrderStatusCommand command) {
        Order order = orderRepository.findById(command.getOrderId())
            .orElseThrow(() -> new OrderNotFoundException());
        
        order.updateStatus(command.getNewStatus());
        orderRepository.save(order);
        
        eventPublisher.publish(new OrderStatusChangedEvent(order));
    }
}

// Query side - optimized for reads, eventual consistency
@Service
public class OrderQueryService {
    
    @Autowired
    private OrderReadModelRepository readModelRepository;
    
    // Read-only transaction for consistency within the query
    @Transactional(readOnly = true)
    public OrderSummary getOrderSummary(String customerId) {
        List<OrderReadModel> orders = readModelRepository
            .findByCustomerId(customerId);
        
        return OrderSummary.builder()
            .totalOrders(orders.size())
            .totalAmount(orders.stream()
                .mapToLong(OrderReadModel::getTotalAmount)
                .sum())
            .recentOrders(orders.stream()
                .sorted((o1, o2) -> o2.getCreatedAt().compareTo(o1.getCreatedAt()))
                .limit(10)
                .collect(Collectors.toList()))
            .build();
    }
    
    // No transaction needed for simple lookups
    public List<OrderReadModel> searchOrders(OrderSearchCriteria criteria) {
        return readModelRepository.search(criteria);
    }
}

// Event handler updates read model asynchronously
@Component
public class OrderReadModelUpdater {
    
    @EventHandler
    @Async
    @Transactional
    public void handle(OrderCreatedEvent event) {
        OrderReadModel readModel = OrderReadModel.builder()
            .orderId(event.getOrderId())
            .customerId(event.getCustomerId())
            .totalAmount(event.getTotalAmount())
            .status(event.getStatus())
            .createdAt(event.getCreatedAt())
            .build();
        
        readModelRepository.save(readModel);
    }
    
    @EventHandler
    @Async
    @Transactional
    public void handle(OrderStatusChangedEvent event) {
        readModelRepository.updateStatus(
            event.getOrderId(), 
            event.getNewStatus(),
            event.getUpdatedAt()
        );
    }
}

Data Locality and Performance Considerations

Data locality is critical to performance issues and I have seen a number of performance issues where data-source was not colocated with the data processing APIs resulting in higher network latency and poor performance:

Multi-Region Database Strategy

@Configuration
public class MultiRegionDataSourceConfig {
    
    // Primary database in same region
    @Bean
    @Primary
    @ConfigurationProperties("spring.datasource.primary")
    public DataSource primaryDataSource() {
        return DataSourceBuilder.create().build();
    }
    
    // Read replica in same availability zone
    @Bean
    @ConfigurationProperties("spring.datasource.read-replica")
    public DataSource readReplicaDataSource() {
        return DataSourceBuilder.create().build();
    }
    
    // Cross-region backup for disaster recovery
    @Bean
    @ConfigurationProperties("spring.datasource.cross-region")
    public DataSource crossRegionDataSource() {
        return DataSourceBuilder.create().build();
    }
}

@Service
public class LocalityAwareOrderService {
    
    @Autowired
    @Qualifier("primaryDataSource")
    private DataSource writeDataSource;
    
    @Autowired
    @Qualifier("readReplicaDataSource") 
    private DataSource readDataSource;
    
    // Writes go to primary in same region
    @Transactional("primaryTransactionManager")
    public void createOrder(CreateOrderRequest request) {
        // Fast writes - same AZ latency ~1ms
        Order order = new Order(request);
        orderRepository.save(order);
    }
    
    // Reads from local replica
    @Transactional(value = "readReplicaTransactionManager", readOnly = true)
    public List<Order> getOrderHistory(String customerId) {
        // Even faster reads - same rack latency ~0.1ms
        return orderRepository.findByCustomerId(customerId);
    }
    
    // Critical path optimization
    @Transactional(timeout = 5) // Fail fast if network issues
    public void processUrgentOrder(UrgentOrderRequest request) {
        // Use connection pooling and keep-alive for predictable latency
        processOrder(request);
    }
}

Go with Regional Database Selection

type RegionalDatabaseConfig struct {
    primaryDB    *gorm.DB
    readReplica  *gorm.DB
    crossRegion  *gorm.DB
}

func NewRegionalDatabaseConfig(region string) *RegionalDatabaseConfig {
    // Select database endpoints based on current region
    primaryDSN := fmt.Sprintf("host=%s-primary.db user=app", region)
    readDSN := fmt.Sprintf("host=%s-read.db user=app", region)
    backupDSN := fmt.Sprintf("host=%s-backup.db user=app", getBackupRegion(region))
    
    return &RegionalDatabaseConfig{
        primaryDB:   connectWithLatencyOptimization(primaryDSN),
        readReplica: connectWithLatencyOptimization(readDSN),
        crossRegion: connectWithLatencyOptimization(backupDSN),
    }
}

func connectWithLatencyOptimization(dsn string) *gorm.DB {
    db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{})
    if err != nil {
        log.Fatal("Database connection failed:", err)
    }
    
    sqlDB, _ := db.DB()
    
    // Optimize for low-latency
    sqlDB.SetMaxOpenConns(20)
    sqlDB.SetMaxIdleConns(20)        // Keep connections alive
    sqlDB.SetConnMaxIdleTime(0)      // Never close idle connections
    sqlDB.SetConnMaxLifetime(time.Hour) // Rotate connections hourly
    
    return db
}

type OrderService struct {
    config *RegionalDatabaseConfig
}

func (s *OrderService) CreateOrder(ctx context.Context, request CreateOrderRequest) error {
    // Use primary database for writes
    return s.config.primaryDB.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
        order := &Order{
            CustomerID:  request.CustomerID,
            TotalAmount: request.TotalAmount,
            Status:      "PENDING",
        }
        
        return tx.Create(order).Error
    })
}

func (s *OrderService) GetOrderHistory(ctx context.Context, customerID string) ([]Order, error) {
    var orders []Order
    
    // Use read replica for queries - better performance
    err := s.config.readReplica.WithContext(ctx).
        Where("customer_id = ?", customerID).
        Order("created_at DESC").
        Find(&orders).Error
        
    return orders, err
}

Understanding Consistency: ACID vs CAP vs Linearizability

One of the most confusing aspects of distributed systems is that “consistency” means different things in different contexts. The C in ACID has nothing to do with the C in CAP, and this distinction is crucial for designing reliable systems.

The Three Faces of Consistency

// ACID Consistency: Data integrity constraints
@Entity
public class BankAccount {
    @Id
    private String accountId;
    
    @Column(nullable = false)
    @Min(0) // ACID Consistency: Balance cannot be negative
    private BigDecimal balance;
    
    @Version
    private Long version;
}

// CAP Consistency: Linearizability across distributed nodes
@Service
public class DistributedAccountService {
    
    // This requires coordination across all replicas
    @Transactional
    public void withdraw(String accountId, BigDecimal amount) {
        // All replicas must see this change at the same logical time
        // = CAP Consistency (Linearizability)
        
        BankAccount account = accountRepository.findById(accountId);
        if (account.getBalance().compareTo(amount) < 0) {
            throw new InsufficientFundsException(); // ACID Consistency violated
        }
        account.setBalance(account.getBalance().subtract(amount));
        accountRepository.save(account); // Maintains ACID invariants
    }
}

ACID Consistency ensures your data satisfies business rules and constraints – balances can’t be negative, foreign keys must reference valid records, etc. It’s about data integrity within a single database.

CAP Consistency (Linearizability) means all nodes in a distributed system see the same data at the same time – as if there’s only one copy of the data. It’s about coordination across multiple machines.

Consistency Models in Distributed Systems

Here’s how different NoSQL systems handle consistency:

DynamoDB: Tunable Consistency

I have used DynamoDB extensively in a number of systems, especially when I worked for a large cloud provider. DynamoDB provides both strongly consistent and eventually consistent reads, but when using strongly consistent reads, all requests go to the leader node, which limits performance and scalability and is also more costly from a financial perspective. We had to carefully evaluate each use case to configure proper consistency settings to balance performance requirements with financial costs.

Additionally, NoSQL databases like DynamoDB don’t typically enforce ACID transactions across multiple items or support normal foreign key constraints, though they do support global secondary indexes. This means you have to manually handle compensating transactions and maintain referential integrity in your application code.

// DynamoDB allows you to choose consistency per operation
@Service
public class DynamoConsistencyService {
    
    @Autowired
    private DynamoDBClient dynamoClient;
    
    // Eventually consistent read - faster, cheaper, may be stale
    public Order getOrder(String orderId) {
        GetItemRequest request = GetItemRequest.builder()
            .tableName("Orders")
            .key(Map.of("orderId", AttributeValue.builder().s(orderId).build()))
            .consistentRead(false) // Eventually consistent
            .build();
            
        return dynamoClient.getItem(request);
    }
    
    // Strongly consistent read - slower, more expensive, always latest
    public Order getOrderStrong(String orderId) {
        GetItemRequest request = GetItemRequest.builder()
            .tableName("Orders")
            .key(Map.of("orderId", AttributeValue.builder().s(orderId).build()))
            .consistentRead(true) // Strongly consistent = linearizable
            .build();
            
        return dynamoClient.getItem(request);
    }
    
    // Quorum-based tunable consistency
    public void updateOrderWithQuorum(String orderId, String newStatus) {
        // W + R > N guarantees consistency
        // W=2, R=2, N=3 means writes must succeed on 2 nodes,
        // reads must check 2 nodes, guaranteeing overlap
        
        UpdateItemRequest request = UpdateItemRequest.builder()
            .tableName("Orders")
            .key(Map.of("orderId", AttributeValue.builder().s(orderId).build()))
            .updateExpression("SET #status = :status")
            .expressionAttributeNames(Map.of("#status", "status"))
            .expressionAttributeValues(Map.of(":status", AttributeValue.builder().s(newStatus).build()))
            .build();
            
        dynamoClient.updateItem(request);
    }
}

Distributed Locking with DynamoDB

Building on my experience implementing distributed locks with DynamoDB, here’s how to prevent double spending in a distributed NoSQL environment:

// Distributed lock implementation for financial operations
@Service
public class DynamoDistributedLockService {
    
    @Autowired
    private DynamoDBClient dynamoClient;
    
    private static final String LOCK_TABLE = "distributed_locks";
    private static final int LOCK_TIMEOUT_SECONDS = 30;
    
    public boolean acquireTransferLock(String fromAccount, String toAccount, String requestId) {
        // Always lock accounts in consistent order to prevent deadlocks
        List<String> sortedAccounts = Arrays.asList(fromAccount, toAccount)
            .stream()
            .sorted()
            .collect(Collectors.toList());
        
        String lockKey = String.join(":", sortedAccounts);
        long expiryTime = System.currentTimeMillis() + (LOCK_TIMEOUT_SECONDS * 1000);
        
        try {
            PutItemRequest request = PutItemRequest.builder()
                .tableName(LOCK_TABLE)
                .item(Map.of(
                    "lock_key", AttributeValue.builder().s(lockKey).build(),
                    "owner", AttributeValue.builder().s(requestId).build(),
                    "expires_at", AttributeValue.builder().n(String.valueOf(expiryTime)).build(),
                    "accounts", AttributeValue.builder().ss(sortedAccounts).build()
                ))
                .conditionExpression("attribute_not_exists(lock_key) OR expires_at < :current_time")
                .expressionAttributeValues(Map.of(
                    ":current_time", AttributeValue.builder().n(String.valueOf(System.currentTimeMillis())).build()
                ))
                .build();
                
            dynamoClient.putItem(request);
            return true;
            
        } catch (ConditionalCheckFailedException e) {
            return false; // Lock already held
        }
    }
    
    public void releaseLock(String fromAccount, String toAccount, String requestId) {
        List<String> sortedAccounts = Arrays.asList(fromAccount, toAccount)
            .stream()
            .sorted()
            .collect(Collectors.toList());
        
        String lockKey = String.join(":", sortedAccounts);
        
        DeleteItemRequest request = DeleteItemRequest.builder()
            .tableName(LOCK_TABLE)
            .key(Map.of("lock_key", AttributeValue.builder().s(lockKey).build()))
            .conditionExpression("owner = :request_id")
            .expressionAttributeValues(Map.of(
                ":request_id", AttributeValue.builder().s(requestId).build()
            ))
            .build();
            
        try {
            dynamoClient.deleteItem(request);
        } catch (ConditionalCheckFailedException e) {
            log.warn("Failed to release lock - may have been taken by another process: {}", lockKey);
        }
    }
}

// Usage in financial service
@Service
public class DynamoFinancialService {
    
    @Autowired
    private DynamoDistributedLockService lockService;
    
    public TransferResult transferFunds(String fromAccount, String toAccount, BigDecimal amount) {
        String requestId = UUID.randomUUID().toString();
        
        // Acquire distributed lock to prevent concurrent transfers
        if (!lockService.acquireTransferLock(fromAccount, toAccount, requestId)) {
            return TransferResult.rejected("Another transfer in progress for these accounts");
        }
        
        try {
            // Now safe to perform transfer with DynamoDB transactions
            Collection<TransactionWriteRequest> actions = new ArrayList<>();
            
            // Conditional update for from account
            actions.add(TransactionWriteRequest.builder()
                .update(Update.builder()
                    .tableName("accounts")
                    .key(Map.of("account_id", AttributeValue.builder().s(fromAccount).build()))
                    .updateExpression("SET balance = balance - :amount")
                    .conditionExpression("balance >= :amount") // Prevent negative balance
                    .expressionAttributeValues(Map.of(
                        ":amount", AttributeValue.builder().n(amount.toString()).build()
                    ))
                    .build())
                .build());
            
            // Conditional update for to account  
            actions.add(TransactionWriteRequest.builder()
                .update(Update.builder()
                    .tableName("accounts")
                    .key(Map.of("account_id", AttributeValue.builder().s(toAccount).build()))
                    .updateExpression("SET balance = balance + :amount")
                    .expressionAttributeValues(Map.of(
                        ":amount", AttributeValue.builder().n(amount.toString()).build()
                    ))
                    .build())
                .build());
            
            // Execute atomic transaction
            TransactWriteItemsRequest txRequest = TransactWriteItemsRequest.builder()
                .transactItems(actions)
                .build();
                
            dynamoClient.transactWriteItems(txRequest);
            return TransferResult.success();
            
        } catch (TransactionCanceledException e) {
            return TransferResult.rejected("Transfer failed - likely insufficient funds");
        } finally {
            lockService.releaseLock(fromAccount, toAccount, requestId);
        }
    }
}

The CAP Theorem in Practice

// Real-world CAP tradeoffs in microservices
@Service
public class CAPAwareOrderService {
    
    // Partition tolerance is mandatory in distributed systems
    // So you choose: Consistency OR Availability
    
    // CP System: Choose Consistency over Availability
    @CircuitBreaker(fallbackMethod = "fallbackCreateOrder")
    public OrderResponse createOrderCP(CreateOrderRequest request) {
        try {
            // All replicas must acknowledge - may fail if partition occurs
            return orderService.createOrderWithQuorum(request);
        } catch (PartitionException e) {
            // System becomes unavailable during partition
            throw new ServiceUnavailableException("Cannot guarantee consistency during partition");
        }
    }
    
    // AP System: Choose Availability over Consistency  
    public OrderResponse createOrderAP(CreateOrderRequest request) {
        try {
            // Try strong consistency first
            return orderService.createOrderWithQuorum(request);
        } catch (PartitionException e) {
            // Fall back to available replicas - may create conflicts
            log.warn("Partition detected, falling back to eventual consistency");
            return orderService.createOrderEventual(request);
        }
    }
    
    // Fallback for CP system
    public OrderResponse fallbackCreateOrder(CreateOrderRequest request, Exception ex) {
        // Return cached response or friendly error
        return OrderResponse.builder()
            .status("DEFERRED")
            .message("Order will be processed when system recovers")
            .build();
    }
}

This distinction between ACID consistency and CAP consistency is fundamental to designing distributed systems. ACID gives you data integrity within a single node, while CAP forces you to choose between strong consistency and availability when networks partition. Understanding these tradeoffs lets you make informed architectural decisions based on your business requirements.

CAP Theorem and Financial Consistency

The CAP theorem creates fundamental challenges for financial systems. You cannot have both strong consistency and availability during network partitions, which forces difficult architectural decisions:

// CP System: Prioritize Consistency over Availability
@Service
public class FinancialConsistencyService {
    
    @CircuitBreaker(fallbackMethod = "rejectTransaction")
    @Transactional(isolation = Isolation.SERIALIZABLE)
    public TransferResult transferFunds(String fromAccount, String toAccount, BigDecimal amount) {
        // Requires consensus across all replicas
        // May become unavailable during partitions
        
        if (!distributedLockService.acquireLock(fromAccount, toAccount)) {
            throw new ServiceUnavailableException("Cannot acquire distributed lock");
        }
        
        try {
            // This operation requires strong consistency across all nodes
            Account from = accountService.getAccountWithQuorum(fromAccount);
            Account to = accountService.getAccountWithQuorum(toAccount);
            
            if (from.getBalance().compareTo(amount) < 0) {
                throw new InsufficientFundsException();
            }
            
            // Both updates must succeed on majority of replicas
            accountService.updateWithQuorum(fromAccount, from.getBalance().subtract(amount));
            accountService.updateWithQuorum(toAccount, to.getBalance().add(amount));
            
            return TransferResult.success();
            
        } finally {
            distributedLockService.releaseLock(fromAccount, toAccount);
        }
    }
    
    public TransferResult rejectTransaction(String fromAccount, String toAccount, 
                                          BigDecimal amount, Exception ex) {
        // During partitions, reject rather than risk double spending
        return TransferResult.rejected("Cannot guarantee consistency during network partition");
    }
}

NoSQL Transaction Limitations: DynamoDB and Compensating Transactions

NoSQL databases often lack ACID guarantees, requiring different strategies:

DynamoDB Optimistic Concurrency

// DynamoDB with optimistic locking using version fields
@DynamoDBTable(tableName = "Orders")
public class DynamoOrder {
    
    @DynamoDBHashKey
    private String orderId;
    
    @DynamoDBAttribute
    private String customerId;
    
    @DynamoDBAttribute
    private Long totalAmount;
    
    @DynamoDBAttribute
    private String status;
    
    @DynamoDBVersionAttribute
    private Long version; // DynamoDB handles optimistic locking
    
    // getters, setters...
}

@Service
public class DynamoOrderService {
    
    @Autowired
    private DynamoDBMapper dynamoMapper;
    
    // Optimistic concurrency with retry
    @Retryable(
        value = {ConditionalCheckFailedException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 100, multiplier = 2)
    )
    public void updateOrderStatus(String orderId, String newStatus) {
        try {
            // Get current version
            DynamoOrder order = dynamoMapper.load(DynamoOrder.class, orderId);
            if (order == null) {
                throw new OrderNotFoundException();
            }
            
            // Update with version check
            order.setStatus(newStatus);
            dynamoMapper.save(order); // Fails if version changed
            
        } catch (ConditionalCheckFailedException e) {
            // Another process modified the record - retry
            throw e;
        }
    }
    
    // Multi-item transaction using DynamoDB transactions (limited)
    public void transferOrderItems(String fromOrderId, String toOrderId, List<String> itemIds) {
        Collection<TransactionWriteRequest> actions = new ArrayList<>();
        
        // Read both orders first
        DynamoOrder fromOrder = dynamoMapper.load(DynamoOrder.class, fromOrderId);
        DynamoOrder toOrder = dynamoMapper.load(DynamoOrder.class, toOrderId);
        
        // Prepare conditional updates
        fromOrder.removeItems(itemIds);
        toOrder.addItems(itemIds);
        
        actions.add(new TransactionWriteRequest()
            .withConditionCheck(new ConditionCheck()
                .withTableName("Orders")
                .withKey(Collections.singletonMap("orderId", new AttributeValue(fromOrderId)))
                .withConditionExpression("version = :version")
                .withExpressionAttributeValues(Collections.singletonMap(":version", 
                    new AttributeValue().withN(fromOrder.getVersion().toString())))))
        
        actions.add(new TransactionWriteRequest()
            .withUpdate(new Update()
                .withTableName("Orders")
                .withKey(Collections.singletonMap("orderId", new AttributeValue(fromOrderId)))
                .withUpdateExpression("SET #items = :items, version = version + :inc")
                .withConditionExpression("version = :currentVersion")
                .withExpressionAttributeNames(Collections.singletonMap("#items", "items"))
                .withExpressionAttributeValues(Map.of(
                    ":items", new AttributeValue().withSS(fromOrder.getItems()),
                    ":inc", new AttributeValue().withN("1"),
                    ":currentVersion", new AttributeValue().withN(fromOrder.getVersion().toString())
                ))));
        
        // Execute transaction
        dynamoClient.transactWriteItems(new TransactWriteItemsRequest()
            .withTransactItems(actions));
    }
}

Compensating Transactions for NoSQL

// SAGA pattern for NoSQL databases without transactions
@Service
public class NoSQLOrderSaga {
    
    @Autowired
    private DynamoOrderService orderService;
    
    @Autowired
    private MongoInventoryService inventoryService;
    
    @Autowired
    private CassandraPaymentService paymentService;
    
    public void processOrderWithCompensation(CreateOrderRequest request) {
        CompensatingTransactionManager saga = new CompensatingTransactionManager();
        
        try {
            // Step 1: Create order in DynamoDB
            String orderId = saga.execute("CREATE_ORDER", 
                () -> orderService.createOrder(request),
                (orderId) -> orderService.deleteOrder(orderId)
            );
            
            // Step 2: Reserve inventory in MongoDB  
            String reservationId = saga.execute("RESERVE_INVENTORY",
                () -> inventoryService.reserveItems(request.getItems()),
                (reservationId) -> inventoryService.releaseReservation(reservationId)
            );
            
            // Step 3: Process payment in Cassandra
            String paymentId = saga.execute("PROCESS_PAYMENT",
                () -> paymentService.processPayment(request.getPaymentInfo()),
                (paymentId) -> paymentService.refundPayment(paymentId)
            );
            
            // All steps successful - confirm order
            orderService.confirmOrder(orderId, paymentId, reservationId);
            
        } catch (Exception e) {
            // Compensate all executed steps
            saga.compensateAll();
            throw new OrderProcessingException("Order processing failed", e);
        }
    }
}

// Generic compensating transaction manager
public class CompensatingTransactionManager {
    
    private final List<CompensatingAction> executedActions = new ArrayList<>();
    
    public <T> T execute(String stepName, Supplier<T> action, Consumer<T> compensation) {
        try {
            T result = action.get();
            executedActions.add(new CompensatingAction<>(stepName, result, compensation));
            return result;
        } catch (Exception e) {
            log.error("Step {} failed: {}", stepName, e.getMessage());
            throw e;
        }
    }
    
    public void compensateAll() {
        // Compensate in reverse order
        for (int i = executedActions.size() - 1; i >= 0; i--) {
            CompensatingAction action = executedActions.get(i);
            try {
                action.compensate();
                log.info("Compensated step: {}", action.getStepName());
            } catch (Exception e) {
                log.error("Compensation failed for step: {}", action.getStepName(), e);
                // In production, you'd send to dead letter queue
            }
        }
    }
    
    @AllArgsConstructor
    private static class CompensatingAction<T> {
        private final String stepName;
        private final T result;
        private final Consumer<T> compensationAction;
        
        public void compensate() {
            compensationAction.accept(result);
        }
        
        public String getStepName() {
            return stepName;
        }
    }
}

Real-World Considerations

After implementing transaction management across dozens of production systems, here are the lessons that only come from battle scars:

Performance vs. Consistency Tradeoffs

// Different strategies for different use cases
@Service
public class OrderService {
    
    // High-consistency financial operations
    @Transactional(isolation = Isolation.SERIALIZABLE)
    public void processPayment(PaymentRequest request) {
        // Strict ACID guarantees
    }
    
    // Analytics operations can be eventually consistent
    @Transactional(readOnly = true, isolation = Isolation.READ_COMMITTED)
    public OrderAnalytics generateAnalytics(String customerId) {
        // Faster reads, acceptable if slightly stale
    }
    
    // Bulk operations need careful batching
    @Transactional
    public void processBulkOrders(List<CreateOrderRequest> requests) {
        int batchSize = 100;
        for (int i = 0; i < requests.size(); i += batchSize) {
            int end = Math.min(i + batchSize, requests.size());
            List<CreateOrderRequest> batch = requests.subList(i, end);
            
            for (CreateOrderRequest request : batch) {
                processOrder(request);
            }
            
            // Flush changes to avoid memory issues
            entityManager.flush();
            entityManager.clear();
        }
    }
}

Monitoring and Observability

Transaction boundaries are invisible until they fail. Proper monitoring is crucial:

@Component
public class TransactionMetrics {
    
    private final MeterRegistry meterRegistry;
    
    @EventListener
    public void handleTransactionCommit(TransactionCommitEvent event) {
        Timer.Sample sample = Timer.start(meterRegistry);
        sample.stop(Timer.builder("transaction.duration")
                .tag("status", "commit")
                .tag("name", event.getTransactionName())
                .register(meterRegistry));
    }
    
    @EventListener
    public void handleTransactionRollback(TransactionRollbackEvent event) {
        Counter.builder("transaction.rollback")
                .tag("reason", event.getRollbackReason())
                .register(meterRegistry)
                .increment();
    }
}

Testing Transaction Boundaries

@TestConfiguration
static class TransactionTestConfig {
    
    @Bean
    @Primary
    public PlatformTransactionManager testTransactionManager() {
        // Use test-specific transaction manager that tracks state
        return new TestTransactionManager();
    }
}

@Test
public void testTransactionRollbackOnFailure() {
    // Given
    CreateOrderRequest request = createValidOrderRequest();
    
    // Mock payment service to fail
    when(paymentService.processPayment(any(), any()))
        .thenThrow(new PaymentException("Payment failed"));
    
    // When
    assertThatThrownBy(() -> orderService.processOrder(request))
        .isInstanceOf(PaymentException.class);
    
    // Then
    assertThat(orderRepository.count()).isEqualTo(0); // No order created
    assertThat(inventoryRepository.findBySku("TEST_SKU").getQuantity())
        .isEqualTo(100); // Inventory not decremented
}

Conclusion

Transaction management isn’t just about preventing data corruption—it’s about building systems that you can reason about, debug, and trust. Whether you’re using Spring’s elegant annotations, Go’s explicit transaction handling, or building custom macros in Rust, the key principles remain the same:

  1. Make transaction boundaries explicit – Either through annotations, function signatures, or naming conventions
  2. Fail fast and fail clearly – Don’t let partial failures create zombie states
  3. Design for compensation – In distributed systems, rollback isn’t always possible
  4. Monitor transaction health – You can’t improve what you don’t measure
  5. Test failure scenarios – Happy path testing doesn’t catch transaction bugs

The techniques we’ve covered—from basic ACID transactions to sophisticated SAGA patterns—form a spectrum of tools. Choose the right tool for your consistency requirements, performance needs, and operational complexity. Remember, the best transaction strategy is the one that lets you sleep soundly at night, knowing your data is safe and your system is predictable.

Older Posts »

Powered by WordPress