Shahzad Bhatti Welcome to my ramblings and rants!

August 25, 2025

Beyond Vibe Coding: Using TLA+ and Executable Specifications with Claude

Filed under: Computing,Uncategorized — admin @ 9:45 pm

TL;DR: The Problem and Solution

Problem: AI-assisted coding fails when modifying existing systems because we give AI vague specifications.

Solution: Use TLA+ formal specifications as precise contracts that Claude can implement reliably.

Result: Transform Claude from a code generator into a reliable engineering partner that reasons about complex systems.

After months of using Claude for development, I discovered most AI-assisted coding fails not because the AI isn’t smart enough, but because we’re asking it to work from vague specifications. This post shows you how to move beyond “vibe coding” using executable specifications that turn Claude into a reliable engineering partner.

Here’s what changes when you use TLA+ with Claude:

Before (Vibe Coding):

  • “Create a task management API”
  • Claude guesses at requirements
  • Inconsistent behavior across edge cases
  • Bugs in corner cases

After (TLA+ Specifications):

  • Precise mathematical specification
  • Claude implements exactly what you specified
  • All edge cases defined upfront
  • Properties verified before deployment

The Vibe Coding Problem

AI assistants like Claude are primarily trained on greenfield development patterns. They excel at:

  • Writing new functions from scratch
  • Implementing well-known algorithms
  • Creating boilerplate code

But they struggle with:

  • Understanding implicit behavioral contracts in existing code
  • Maintaining invariants across system modifications
  • Reasoning about state transitions and edge cases
  • Preserving non-functional requirements (performance, security, etc.)

The solution isn’t better prompts – it’s better specifications.

Enter Executable Specifications

An executable specification is a formal description of system behavior that can be:

  1. Verified – Checked for logical consistency
  2. Validated – Tested against real-world scenarios
  3. Executed – Run to generate test cases or even implementations

I’ve tried many approaches to precise specifications over the years:

UML and Model Driven Development (2000s-2010s): I used tools like Rational Rose and Visual Paradigm in early 2000s that promised complete code generation from UML models. The reality was different:

  • Visual complexity: UML diagrams became unwieldy for anything non-trivial
  • Tool lock-in: Proprietary formats and expensive tooling
  • Impedance mismatch: The gap between UML models and real code was huge
  • Maintenance nightmare: Keeping models and code synchronized was nearly impossible
  • Limited expressiveness: UML couldn’t capture complex behavioral contracts

BDD and Gherkin (mid-2000s): I used BDD and Gherkin in mid 2000s, which were better than UML for behavioral specifications, but still limited:

  • Structured natural language: Readable but not truly executable
  • No logical reasoning: Couldn’t catch design contradictions
  • Testing focused: Good for acceptance criteria, poor for system design

TLA+ (present): Takes executable specifications to their logical conclusion:

  • Mathematical precision: Eliminates ambiguity completely
  • Model checking: Explores all possible execution paths
  • Tool independence: Plain text specifications, open source tools
  • Behavioral focus: Designed specifically for concurrent and distributed systems

Why TLA+ with Claude?

The magic happens when you combine TLA+’s precision with Claude’s implementation capabilities:

  1. TLA+ eliminates ambiguity – There’s only one way to interpret a formal specification
  2. Claude can read TLA+ – It understands the formal syntax and can translate it to code
  3. Verification catches design flaws – TLA+ model checking finds edge cases you’d miss
  4. Generated traces become tests – TLA+ execution paths become your test suite

Setting Up Your Claude and TLA+ Environment

Installing Claude Desktop

First, let’s get Claude running on your machine:

# Install via Homebrew (macOS)
brew install --cask claude

# Or download directly from Anthropic
# https://claude.ai/download
  • Set up project-specific contexts in ~/.claude/
  • Create TLA+ syntax rules for better code generation
  • Configure memory settings for specification patterns

Configuring Your Workspace

Once installed, I recommend creating a dedicated workspace structure. Here’s what works for me:

# Create a Claude workspace directory
mkdir -p ~/claude-workspace/{projects,templates,context}

# Add a context file for your coding standards
cat > ~/claude-workspace/context/coding-standards.md << 'EOF'
# My Coding Standards

- Use descriptive variable names
- Functions should do one thing well
- Write tests for all new features
- Handle errors explicitly
- Document complex logic
EOF

Installing TLA+ Tools

Choose based on your workflow

  • GUI users: TLA+ Toolbox for visual model checking
  • CLI users: tla2tools.jar for CI integration
  • Both: VS Code extension for syntax highlighting
# Download TLA+ Tools from https://github.com/tlaplus/tlaplus/releases
# Or use Homebrew on macOS
brew install --cask tla-plus-toolbox

# For command-line usage (recommended for CI)
wget https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar

VS Code Extension

Install the TLA+ extension for syntax highlighting and basic validation:

code --install-extension alygin.vscode-tlaplus

Your First TLA+ Specification

Let’s start with a simple example to understand the syntax:

--------------------------- MODULE SimpleCounter ---------------------------
VARIABLE counter

Init == counter = 0

Increment == counter' = counter + 1

Decrement == counter' = counter - 1

Next == Increment \/ Decrement

Spec == Init /\ [][Next]_counter

TypeInvariant == counter \in Int

=============================================================================

This specification defines:

  • State: A counter variable
  • Initial condition: Counter starts at 0
  • Actions: Increment or decrement operations
  • Next state relation: Either action can occur
  • Invariant: Counter is always an integer

Real-World Example: Task Management API

Now let’s build something real. We’ll create a task management API using TLA+ specifications that Claude can implement in Go.

Step 1: Define the System State

First, we model what our system looks like (TaskManagement.tla):

--------------------------- MODULE TaskManagement ---------------------------
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Users,          \* Set of users
    MaxTasks,       \* Maximum number of tasks
    MaxTime,        \* Maximum time value for simulation
    Titles,         \* Set of possible task titles
    Descriptions    \* Set of possible task descriptions

VARIABLES
    tasks,          \* Function from task ID to task record
    userTasks,      \* Function from user ID to set of task IDs
    nextTaskId,     \* Counter for generating unique task IDs
    currentUser,    \* Currently authenticated user
    clock,          \* Global clock for timestamps
    sessions        \* Active user sessions

\* Task states enumeration with valid transitions
TaskStates == {"pending", "in_progress", "completed", "cancelled", "blocked"}

\* Priority levels
Priorities == {"low", "medium", "high", "critical"}

\* Valid state transitions
ValidTransitions == {
    <<"pending", "in_progress">>,
    <<"pending", "cancelled">>,
    <<"pending", "blocked">>,
    <<"in_progress", "completed">>,
    <<"in_progress", "cancelled">>,
    <<"in_progress", "blocked">>,
    <<"in_progress", "pending">>,      \* Allow reverting to pending
    <<"blocked", "pending">>,
    <<"blocked", "in_progress">>,
    <<"blocked", "cancelled">>
}


TaskRecord == [
    id: Nat,
    title: STRING,
    description: STRING,
    status: TaskStates,
    priority: {"low", "medium", "high"},
    assignee: Users,
    createdAt: Nat,
    dueDate: Nat \cup {NULL}
]

\* Type invariants
TypeInvariant == 
    /\ tasks \in [Nat -> TaskRecord]
    /\ userTasks \in [Users -> SUBSET Nat]
    /\ nextTaskId \in Nat
    /\ currentUser \in Users \cup {NULL}

Step 2: Define System Actions

Now we specify what operations are possible (TaskManagement.tla):

\* System initialization
Init ==
    /\ tasks = [i \in {} |-> CHOOSE x : FALSE]  \* Empty function
    /\ userTasks = [u \in Users |-> {}]
    /\ nextTaskId = 1
    /\ currentUser = "NULL"
    /\ clock = 0
    /\ sessions = [u \in Users |-> FALSE]

\* User authentication
Authenticate(user) ==
    /\ user \in Users
    /\ ~sessions[user]  \* User not already logged in
    /\ currentUser' = user
    /\ sessions' = [sessions EXCEPT ![user] = TRUE]
    /\ UNCHANGED <<tasks, userTasks, nextTaskId, clock>>

\* Create a new task
CreateTask(title, description, priority, dueDate) ==
    /\ currentUser # NULL
    /\ nextTaskId <= MaxTasks
    /\ LET newTask == [
           id |-> nextTaskId,
           title |-> title,
           description |-> description,
           status |-> "pending",
           priority |-> priority,
           assignee |-> currentUser,
           createdAt |-> nextTaskId, \* Simplified timestamp
           dueDate |-> dueDate
       ] IN
       /\ tasks' = tasks @@ (nextTaskId :> newTask)
       /\ userTasks' = [userTasks EXCEPT ![currentUser] = @ \cup {nextTaskId}]
       /\ nextTaskId' = nextTaskId + 1
       /\ UNCHANGED currentUser

\* Update task status
UpdateTaskStatus(taskId, newStatus) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ newStatus \in TaskStates
    /\ tasks' = [tasks EXCEPT ![taskId].status = newStatus]
    /\ UNCHANGED <<userTasks, nextTaskId, currentUser>>

\* Delete a task
DeleteTask(taskId) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ tasks' = [id \in (DOMAIN tasks \ {taskId}) |-> tasks[id]]
    /\ userTasks' = [userTasks EXCEPT ![currentUser] = @ \ {taskId}]
    /\ UNCHANGED <<nextTaskId, currentUser>>

Step 3: Safety and Liveness Properties

TLA+ shines when defining system properties (TaskManagement.tla):

\* Safety properties
NoOrphanTasks ==
    \A taskId \in DOMAIN tasks :
        \E user \in Users : taskId \in GetUserTasks(user)

TaskOwnership ==
    \A taskId \in DOMAIN tasks :
        tasks[taskId].assignee \in Users /\
        taskId \in GetUserTasks(tasks[taskId].assignee)

ValidTaskIds ==
    \A taskId \in DOMAIN tasks : 
        /\ taskId < nextTaskId
        /\ taskId >= 1

NoDuplicateTaskIds ==
    \A t1, t2 \in DOMAIN tasks :
        t1 = t2 \/ tasks[t1].id # tasks[t2].id

ValidStateTransitionsInvariant ==
    \A taskId \in DOMAIN tasks :
        tasks[taskId].status \in TaskStates

ConsistentTimestamps ==
    \A taskId \in DOMAIN tasks :
        /\ tasks[taskId].createdAt <= tasks[taskId].updatedAt
        /\ tasks[taskId].updatedAt <= clock

NoCyclicDependencies ==
    LET
        \* Transitive closure of dependencies
        RECURSIVE TransitiveDeps(_)
        TransitiveDeps(taskId) ==
            IF ~TaskExists(taskId) THEN {}
            ELSE LET directDeps == tasks[taskId].dependencies IN
                 directDeps \cup 
                 UNION {TransitiveDeps(dep) : dep \in directDeps}
    IN
    \A taskId \in DOMAIN tasks :
        taskId \notin TransitiveDeps(taskId)

AuthenticationRequired ==
    \* All task operations require authentication
    \A taskId \in DOMAIN tasks :
        tasks[taskId].createdBy \in Users

SafetyInvariant ==
    /\ NoOrphanTasks
    /\ TaskOwnership
    /\ ValidTaskIds
    /\ NoDuplicateTaskIds
    /\ ValidStateTransitionsInvariant
    /\ ConsistentTimestamps
    /\ NoCyclicDependencies
    /\ AuthenticationRequired

\* Next state relation
Next ==
    \/ AdvanceTime
    \/ \E user \in Users : Authenticate(user)
    \/ Logout
    \/ \E t \in Titles, d \in Descriptions, p \in Priorities, 
         u \in Users, dd \in 0..MaxTime \cup {"NULL"},
         tags \in SUBSET {"bug", "feature", "enhancement", "documentation"},
         deps \in SUBSET DOMAIN tasks :
       CreateTask(t, d, p, u, dd, tags, deps)
    \/ \E taskId \in DOMAIN tasks, newStatus \in TaskStates :
       UpdateTaskStatus(taskId, newStatus)
    \/ \E taskId \in DOMAIN tasks, newPriority \in Priorities :
       UpdateTaskPriority(taskId, newPriority)
    \/ \E taskId \in DOMAIN tasks, newAssignee \in Users :
       ReassignTask(taskId, newAssignee)
    \/ \E taskId \in DOMAIN tasks, t \in Titles, 
         d \in Descriptions, dd \in 0..MaxTime \cup {"NULL"} :
       UpdateTaskDetails(taskId, t, d, dd)
    \/ \E taskId \in DOMAIN tasks : DeleteTask(taskId)
    \/ CheckDependencies
    \/ \E taskIds \in SUBSET DOMAIN tasks, newStatus \in TaskStates :
       taskIds # {} /\ BulkUpdateStatus(taskIds, newStatus)

\* Properties to check
THEOREM TypeCorrectness == Spec => []TypeInvariant
THEOREM SafetyHolds == Spec => []SafetyInvariant
THEOREM LivenessHolds == Spec => (EventualCompletion /\ FairProgress)
THEOREM NoDeadlock == Spec => []<>Next
THEOREM Termination == Spec => <>(\A taskId \in DOMAIN tasks : 
                                    tasks[taskId].status \in {"completed", "cancelled"})
=============================================================================

Step 4: Model Checking and Trace Generation

Now we can run TLA+ model checking to verify our specification (TaskManagement.cfg):

\* Model configuration for TaskManagementImproved module
SPECIFICATION Spec

\* Constants definition
CONSTANTS
    Users = {alice, bob, charlie}
    MaxTasks = 5
    MaxTime = 20
    Titles = {task1, task2, task3, task4, task5}
    Descriptions = {desc1, desc2, desc3}

\* Model values for special constants
CONSTANT
    NULL = NULL
    EMPTY_STRING = EMPTY_STRING

\* Initial state constraint
CONSTRAINT
    /\ nextTaskId <= MaxTasks + 1
    /\ clock <= MaxTime
    /\ Cardinality(DOMAIN tasks) <= MaxTasks

\* State space reduction (optional, for faster checking)
ACTION_CONSTRAINT
    \* Limit number of active sessions
    /\ Cardinality({u \in Users : sessions[u] = TRUE}) <= 2
    \* Prevent creating too many tasks at once
    /\ nextTaskId <= MaxTasks

\* Invariants to check
INVARIANT TypeInvariant
INVARIANT SafetyInvariant
INVARIANT NoOrphanTasks
INVARIANT TaskOwnership
INVARIANT ValidTaskIds
INVARIANT NoDuplicateTaskIds
INVARIANT ValidStateTransitionsInvariant
INVARIANT ConsistentTimestamps
INVARIANT NoCyclicDependencies
INVARIANT AuthenticationRequired

\* Properties to check
PROPERTY EventualCompletion
PROPERTY FairProgress
PROPERTY EventualUnblocking
PROPERTY EventualAuthentication
PROPERTY NoStarvation

\* Check for deadlocks
CHECK_DEADLOCK TRUE

\* View for debugging (optional)
VIEW <<nextTaskId, Cardinality(DOMAIN tasks), clock>>

\* Alias for better state visualization
ALIAS TaskSummary == [
    totalTasks |-> Cardinality(DOMAIN tasks),
    pendingTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "pending"}),
    inProgressTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "in_progress"}),
    completedTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "completed"}),
    blockedTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "blocked"}),
    activeSessions |-> {u \in Users : sessions[u] = TRUE},
    currentUser |-> currentUser,
    time |-> clock
]
# Run the model checker
java -jar tla2tools.jar -config TaskManagement.cfg TaskManagement.tla

TLA+ will explore all possible execution paths and either:

  1. Verify that all properties hold
  2. Generate counterexamples showing property violations

Step 5: Claude Implementation with TLA+ Context

Now comes the magic. Instead of vague natural language, we give Claude the precise TLA+ specification:

My prompt to Claude:

Using this TLA+ specification for a task management system, implement a Go REST API 
that satisfies all the specified behaviors and properties.

[Include the full TLA+ specification here]

Requirements:
1. Implement all actions defined in the TLA+ spec
2. Ensure all safety properties hold at runtime
3. Use Go's type system to enforce the data structures
4. Include comprehensive tests that cover the execution traces
5. Add proper error handling for violation of preconditions

Focus on behavioral correctness over performance optimization.

Claude generates something like (task.go):

// Package domain contains the core business entities matching the TLA+ specification
package domain

import (
	"fmt"
	"time"
)

// TaskID represents a unique task identifier (maps to TLA+ task ID)
type TaskID int

// UserID represents a user identifier (maps to TLA+ Users set)
type UserID string

// TaskStatus represents the state of a task (maps to TLA+ TaskStates)
type TaskStatus string

const (
	StatusPending    TaskStatus = "pending"
	StatusInProgress TaskStatus = "in_progress"
	StatusCompleted  TaskStatus = "completed"
	StatusCancelled  TaskStatus = "cancelled"
	StatusBlocked    TaskStatus = "blocked"
)

// Priority represents task priority levels (maps to TLA+ Priorities)
type Priority string

const (
	PriorityLow      Priority = "low"
	PriorityMedium   Priority = "medium"
	PriorityHigh     Priority = "high"
	PriorityCritical Priority = "critical"
)

// Tag represents task categories (maps to TLA+ tags subset)
type Tag string

const (
	TagBug           Tag = "bug"
	TagFeature       Tag = "feature"
	TagEnhancement   Tag = "enhancement"
	TagDocumentation Tag = "documentation"
)

// Task represents a task entity (maps to TLA+ task record)
type Task struct {
	ID           TaskID            `json:"id"`
	Title        string            `json:"title"`
	Description  string            `json:"description"`
	Status       TaskStatus        `json:"status"`
	Priority     Priority          `json:"priority"`
	Assignee     UserID            `json:"assignee"`
	CreatedBy    UserID            `json:"created_by"`
	CreatedAt    time.Time         `json:"created_at"`
	UpdatedAt    time.Time         `json:"updated_at"`
	DueDate      *time.Time        `json:"due_date,omitempty"`
	Tags         []Tag             `json:"tags"`
	Dependencies map[TaskID]bool   `json:"dependencies"`
}

// ValidTransition represents a valid state transition (maps to TLA+ ValidTransitions)
type ValidTransition struct {
	From TaskStatus
	To   TaskStatus
}

// ValidTransitions defines all allowed state transitions
var ValidTransitions = map[ValidTransition]bool{
	{StatusPending, StatusInProgress}:    true,
	{StatusPending, StatusCancelled}:     true,
	{StatusPending, StatusBlocked}:       true,
	{StatusInProgress, StatusCompleted}:  true,
	{StatusInProgress, StatusCancelled}:  true,
	{StatusInProgress, StatusBlocked}:    true,
	{StatusInProgress, StatusPending}:    true, // Allow reverting
	{StatusBlocked, StatusPending}:       true,
	{StatusBlocked, StatusInProgress}:    true,
	{StatusBlocked, StatusCancelled}:     true,
}

// IsValidTransition checks if a state transition is valid (maps to TLA+ IsValidTransition)
func IsValidTransition(from, to TaskStatus) bool {
	return ValidTransitions[ValidTransition{From: from, To: to}]
}

// CanDelete checks if a task can be deleted (only completed or cancelled)
func (t *Task) CanDelete() bool {
	return t.Status == StatusCompleted || t.Status == StatusCancelled
}

// IsBlocked checks if task should be blocked based on dependencies
func (t *Task) IsBlocked(allTasks map[TaskID]*Task) bool {
	if len(t.Dependencies) == 0 {
		return false
	}
	
	for depID := range t.Dependencies {
		if dep, exists := allTasks[depID]; exists {
			if dep.Status != StatusCompleted {
				return true
			}
		}
	}
	return false
}

// ShouldUnblock checks if a blocked task can be unblocked
func (t *Task) ShouldUnblock(allTasks map[TaskID]*Task) bool {
	if t.Status != StatusBlocked {
		return false
	}
	
	for depID := range t.Dependencies {
		if dep, exists := allTasks[depID]; exists {
			if dep.Status != StatusCompleted {
				return false
			}
		}
	}
	return true
}

// Validate performs domain validation on the task
func (t *Task) Validate() error {
	if t.Title == "" {
		return fmt.Errorf("task title cannot be empty")
	}
	if t.Description == "" {
		return fmt.Errorf("task description cannot be empty")
	}
	if !isValidStatus(t.Status) {
		return fmt.Errorf("invalid task status: %s", t.Status)
	}
	if !isValidPriority(t.Priority) {
		return fmt.Errorf("invalid task priority: %s", t.Priority)
	}
	if t.Assignee == "" {
		return fmt.Errorf("task must have an assignee")
	}
	if t.CreatedBy == "" {
		return fmt.Errorf("task must have a creator")
	}
	if t.CreatedAt.After(t.UpdatedAt) {
		return fmt.Errorf("created time cannot be after updated time")
	}
	for _, tag := range t.Tags {
		if !isValidTag(tag) {
			return fmt.Errorf("invalid tag: %s", tag)
		}
	}
	return nil
}

func isValidStatus(status TaskStatus) bool {
	switch status {
	case StatusPending, StatusInProgress, StatusCompleted, StatusCancelled, StatusBlocked:
		return true
	default:
		return false
	}
}

func isValidPriority(priority Priority) bool {
	switch priority {
	case PriorityLow, PriorityMedium, PriorityHigh, PriorityCritical:
		return true
	default:
		return false
	}
}

func isValidTag(tag Tag) bool {
	switch tag {
	case TagBug, TagFeature, TagEnhancement, TagDocumentation:
		return true
	default:
		return false
	}
}
// Package usecase implements the TLA+ actions as use cases
package usecase

import (
	"crypto/rand"
	"encoding/hex"
	"fmt"
	"time"
	
	"github.com/bhatti/sample-task-management/internal/domain"
	"github.com/bhatti/sample-task-management/internal/repository"
)

// TaskUseCase implements task-related TLA+ actions
type TaskUseCase struct {
	uow              repository.UnitOfWork
	invariantChecker InvariantChecker
}

// InvariantChecker interface for runtime invariant validation
type InvariantChecker interface {
	CheckAllInvariants(state *domain.SystemState) error
	CheckTaskInvariants(task *domain.Task, state *domain.SystemState) error
	CheckTransitionInvariant(from, to domain.TaskStatus) error
}

// NewTaskUseCase creates a new task use case
func NewTaskUseCase(uow repository.UnitOfWork, checker InvariantChecker) *TaskUseCase {
	return &TaskUseCase{
		uow:              uow,
		invariantChecker: checker,
	}
}

// Authenticate implements TLA+ Authenticate action
func (uc *TaskUseCase) Authenticate(userID domain.UserID) (*domain.Session, error) {
	// Preconditions from TLA+:
	// - user \in Users
	// - ~sessions[user]
	
	user, err := uc.uow.Users().GetUser(userID)
	if err != nil {
		return nil, fmt.Errorf("user not found: %w", err)
	}
	
	// Check if user already has an active session
	existingSession, _ := uc.uow.Sessions().GetSessionByUser(userID)
	if existingSession != nil && existingSession.IsValid() {
		return nil, fmt.Errorf("user %s already has an active session", userID)
	}
	
	// Create new session
	token := generateToken()
	session := &domain.Session{
		UserID:    user.ID,
		Token:     token,
		Active:    true,
		CreatedAt: time.Now(),
		ExpiresAt: time.Now().Add(24 * time.Hour),
	}
	
	// Update state
	if err := uc.uow.Sessions().CreateSession(session); err != nil {
		return nil, fmt.Errorf("failed to create session: %w", err)
	}
	
	if err := uc.uow.SystemState().SetCurrentUser(&userID); err != nil {
		return nil, fmt.Errorf("failed to set current user: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return nil, fmt.Errorf("invariant violation: %w", err)
	}
	
	return session, nil
}


// CreateTask implements TLA+ CreateTask action
func (uc *TaskUseCase) CreateTask(
	title, description string,
	priority domain.Priority,
	assignee domain.UserID,
	dueDate *time.Time,
	tags []domain.Tag,
	dependencies []domain.TaskID,
) (*domain.Task, error) {
	// Preconditions from TLA+:
	// - currentUser # NULL
	// - currentUser \in Users
	// - nextTaskId <= MaxTasks
	// - deps \subseteq DOMAIN tasks
	// - \A dep \in deps : tasks[dep].status # "cancelled"
	
	currentUser, err := uc.uow.SystemState().GetCurrentUser()
	if err != nil || currentUser == nil {
		return nil, fmt.Errorf("authentication required")
	}
	
	// Check max tasks limit
	nextID, err := uc.uow.SystemState().GetNextTaskID()
	if err != nil {
		return nil, fmt.Errorf("failed to get next task ID: %w", err)
	}
	
	if nextID > domain.MaxTasks {
		return nil, fmt.Errorf("maximum number of tasks (%d) reached", domain.MaxTasks)
	}
	
	// Validate dependencies
	allTasks, err := uc.uow.Tasks().GetAllTasks()
	if err != nil {
		return nil, fmt.Errorf("failed to get tasks: %w", err)
	}
	
	depMap := make(map[domain.TaskID]bool)
	for _, depID := range dependencies {
		depTask, exists := allTasks[depID]
		if !exists {
			return nil, fmt.Errorf("dependency task %d does not exist", depID)
		}
		if depTask.Status == domain.StatusCancelled {
			return nil, fmt.Errorf("cannot depend on cancelled task %d", depID)
		}
		depMap[depID] = true
	}
	
	// Check for cyclic dependencies
	if err := uc.checkCyclicDependencies(nextID, depMap, allTasks); err != nil {
		return nil, err
	}
	
	// Determine initial status based on dependencies
	status := domain.StatusPending
	if len(dependencies) > 0 {
		// Check if all dependencies are completed
		allCompleted := true
		for depID := range depMap {
			if allTasks[depID].Status != domain.StatusCompleted {
				allCompleted = false
				break
			}
		}
		if !allCompleted {
			status = domain.StatusBlocked
		}
	}
	
	// Create task
	task := &domain.Task{
		ID:           nextID,
		Title:        title,
		Description:  description,
		Status:       status,
		Priority:     priority,
		Assignee:     assignee,
		CreatedBy:    *currentUser,
		CreatedAt:    time.Now(),
		UpdatedAt:    time.Now(),
		DueDate:      dueDate,
		Tags:         tags,
		Dependencies: depMap,
	}
	
	// Validate task
	if err := task.Validate(); err != nil {
		return nil, fmt.Errorf("task validation failed: %w", err)
	}
	
	// Save task
	if err := uc.uow.Tasks().CreateTask(task); err != nil {
		return nil, fmt.Errorf("failed to create task: %w", err)
	}
	
	// Increment next task ID
	if _, err := uc.uow.SystemState().IncrementNextTaskID(); err != nil {
		return nil, fmt.Errorf("failed to increment task ID: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return nil, fmt.Errorf("invariant violation after task creation: %w", err)
	}
	
	return task, nil
}

// UpdateTaskStatus implements TLA+ UpdateTaskStatus action
func (uc *TaskUseCase) UpdateTaskStatus(taskID domain.TaskID, newStatus domain.TaskStatus) error {
	// Preconditions from TLA+:
	// - currentUser # NULL
	// - TaskExists(taskId)
	// - taskId \in GetUserTasks(currentUser)
	// - IsValidTransition(tasks[taskId].status, newStatus)
	// - newStatus = "in_progress" => all dependencies completed
	
	currentUser, err := uc.uow.SystemState().GetCurrentUser()
	if err != nil || currentUser == nil {
		return fmt.Errorf("authentication required")
	}
	
	task, err := uc.uow.Tasks().GetTask(taskID)
	if err != nil {
		return fmt.Errorf("task not found: %w", err)
	}
	
	// Check user owns the task
	userTasks, err := uc.uow.SystemState().GetUserTasks(*currentUser)
	if err != nil {
		return fmt.Errorf("failed to get user tasks: %w", err)
	}
	
	hasTask := false
	for _, id := range userTasks {
		if id == taskID {
			hasTask = true
			break
		}
	}
	
	if !hasTask {
		return fmt.Errorf("user does not have access to task %d", taskID)
	}
	
	// Check valid transition
	if !domain.IsValidTransition(task.Status, newStatus) {
		return fmt.Errorf("invalid transition from %s to %s", task.Status, newStatus)
	}
	
	// Check dependencies if moving to in_progress
	if newStatus == domain.StatusInProgress {
		allTasks, _ := uc.uow.Tasks().GetAllTasks()
		for depID := range task.Dependencies {
			if depTask, exists := allTasks[depID]; exists {
				if depTask.Status != domain.StatusCompleted {
					return fmt.Errorf("cannot start task: dependency %d is not completed", depID)
				}
			}
		}
	}
	
	// Update status
	task.Status = newStatus
	task.UpdatedAt = time.Now()
	
	if err := uc.uow.Tasks().UpdateTask(task); err != nil {
		return fmt.Errorf("failed to update task: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return fmt.Errorf("invariant violation: %w", err)
	}
	
	return nil
}
...

Step 6: TLA+ Generated Tests

The real power comes when we use TLA+ execution traces to generate comprehensive tests:

My prompt to Claude:

Generate Go tests that verify the implementation satisfies the TLA+ specification.
Create test cases that:

1. Test all TLA+ actions with valid preconditions
2. Test safety property violations
3. Test edge cases from the TLA+ model boundary conditions
4. Use property-based testing where appropriate

Include tests that would catch the execution traces TLA+ model checker explores.

Claude generates (invariant_test.go):

// Package property implements property-based tests for TLA+ invariants
package property

import (
	"math/rand"
	"testing"
	"time"

	"github.com/bhatti/sample-task-management/internal/domain"
	"github.com/bhatti/sample-task-management/internal/infrastructure/memory"
	"github.com/bhatti/sample-task-management/internal/usecase"
	"github.com/bhatti/sample-task-management/pkg/invariants"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

// TestInvariantsHoldAfterOperations verifies invariants hold after each operation
func TestInvariantsHoldAfterOperations(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()
	uc := usecase.NewTaskUseCase(uow, checker)

	// Setup initial users
	users := []domain.UserID{"alice", "bob", "charlie"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		require.NoError(t, repo.CreateUser(user))
	}

	// Property: Invariants hold after authentication
	t.Run("InvariantsAfterAuthentication", func(t *testing.T) {
		for _, userID := range users {
			session, err := uc.Authenticate(userID)
			assert.NoError(t, err)
			assert.NotNil(t, session)

			state, _ := repo.GetSystemState()
			assert.NoError(t, checker.CheckAllInvariants(state))

			// Cleanup
			_ = uc.Logout(userID)
		}
	})

	// Property: Invariants hold after task creation
	t.Run("InvariantsAfterTaskCreation", func(t *testing.T) {
		uc.Authenticate("alice")

		for i := 0; i < 10; i++ {
			task, err := uc.CreateTask(
				"Task "+string(rune(i)),
				"Description",
				randomPriority(),
				randomUser(users),
				randomDueDate(),
				randomTags(),
				[]domain.TaskID{}, // No dependencies initially
			)

			assert.NoError(t, err)
			assert.NotNil(t, task)

			state, _ := repo.GetSystemState()
			assert.NoError(t, checker.CheckAllInvariants(state))
		}
	})

	// Property: Invariants hold after status transitions
	t.Run("InvariantsAfterStatusTransitions", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create a task
		task, _ := uc.CreateTask(
			"Test Task",
			"Description",
			domain.PriorityMedium,
			"alice",
			nil,
			[]domain.Tag{domain.TagFeature},
			[]domain.TaskID{},
		)

		// Valid transitions
		validTransitions := []domain.TaskStatus{
			domain.StatusInProgress,
			domain.StatusCompleted,
		}

		for _, status := range validTransitions {
			err := uc.UpdateTaskStatus(task.ID, status)
			if err == nil {
				state, _ := repo.GetSystemState()
				assert.NoError(t, checker.CheckAllInvariants(state))
			}
		}
	})

	// Property: No cyclic dependencies can be created
	t.Run("NoCyclicDependencies", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create tasks with potential cycles
		task1, _ := uc.CreateTask("Task1", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{})
		task2, _ := uc.CreateTask("Task2", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{task1.ID})
		task3, _ := uc.CreateTask("Task3", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{task2.ID})

		// Attempting to create a cycle should fail
		_, err := uc.CreateTask("Task4", "Desc", domain.PriorityLow, "alice", nil, nil,
			[]domain.TaskID{task3.ID, task1.ID}) // This would create a cycle
		assert.NoError(t, err)

		// Even if it doesn't fail explicitly, invariants should catch it
		state, _ := repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))
	})
}

// TestTransitionInvariants tests state transition validity
func TestTransitionInvariants(t *testing.T) {
	checker := invariants.NewInvariantChecker()

	// Test all valid transitions
	validTransitions := []struct {
		from domain.TaskStatus
		to   domain.TaskStatus
	}{
		{domain.StatusPending, domain.StatusInProgress},
		{domain.StatusPending, domain.StatusCancelled},
		{domain.StatusInProgress, domain.StatusCompleted},
		{domain.StatusInProgress, domain.StatusCancelled},
		{domain.StatusBlocked, domain.StatusPending},
		{domain.StatusBlocked, domain.StatusCancelled},
	}

	for _, trans := range validTransitions {
		t.Run(string(trans.from)+"_to_"+string(trans.to), func(t *testing.T) {
			err := checker.CheckTransitionInvariant(trans.from, trans.to)
			assert.NoError(t, err)
		})
	}

	// Test invalid transitions
	invalidTransitions := []struct {
		from domain.TaskStatus
		to   domain.TaskStatus
	}{
		{domain.StatusCompleted, domain.StatusPending},
		{domain.StatusCompleted, domain.StatusInProgress},
		{domain.StatusCancelled, domain.StatusInProgress},
		{domain.StatusPending, domain.StatusCompleted}, // Must go through in_progress
	}

	for _, trans := range invalidTransitions {
		t.Run("Invalid_"+string(trans.from)+"_to_"+string(trans.to), func(t *testing.T) {
			err := checker.CheckTransitionInvariant(trans.from, trans.to)
			assert.Error(t, err)
		})
	}
}

// TestPropertyTaskOwnership verifies task ownership invariants
func TestPropertyTaskOwnership(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()
	uc := usecase.NewTaskUseCase(uow, checker)

	// Setup users
	users := []domain.UserID{"alice", "bob"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		repo.CreateUser(user)
	}

	// Property: Task reassignment maintains ownership invariants
	t.Run("ReassignmentMaintainsOwnership", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create task assigned to Alice
		task, err := uc.CreateTask(
			"Test Task",
			"Description",
			domain.PriorityHigh,
			"alice",
			nil,
			[]domain.Tag{domain.TagBug},
			[]domain.TaskID{},
		)
		require.NoError(t, err)

		// Check initial ownership
		state, _ := repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))

		aliceTasks := state.GetUserTasks("alice")
		assert.Contains(t, aliceTasks, task.ID)

		// Reassign to Bob
		err = uc.ReassignTask(task.ID, "bob")
		require.NoError(t, err)

		// Check ownership after reassignment
		state, _ = repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))

		aliceTasks = state.GetUserTasks("alice")
		bobTasks := state.GetUserTasks("bob")
		assert.NotContains(t, aliceTasks, task.ID)
		assert.Contains(t, bobTasks, task.ID)
	})
}

// TestPropertyConcurrentOperations tests invariants under concurrent operations
func TestPropertyConcurrentOperations(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()

	// Setup users
	users := []domain.UserID{"user1", "user2", "user3"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		repo.CreateUser(user)
	}

	// Run concurrent operations
	done := make(chan bool, len(users))

	for _, userID := range users {
		go func(uid domain.UserID) {
			uc := usecase.NewTaskUseCase(uow, checker)

			// Authenticate
			uc.Authenticate(uid)

			// Create multiple tasks
			for i := 0; i < 5; i++ {
				uc.CreateTask(
					"Task",
					"Description",
					randomPriority(),
					uid,
					nil,
					randomTags(),
					[]domain.TaskID{},
				)

				// Random delay
				time.Sleep(time.Duration(rand.Intn(10)) * time.Millisecond)
			}

			done <- true
		}(userID)
	}

	// Wait for all goroutines
	for i := 0; i < len(users); i++ {
		<-done
	}

	// Check invariants after concurrent operations
	state, _ := repo.GetSystemState()
	assert.NoError(t, checker.CheckAllInvariants(state))
}

// Helper functions

func randomPriority() domain.Priority {
	priorities := []domain.Priority{
		domain.PriorityLow,
		domain.PriorityMedium,
		domain.PriorityHigh,
		domain.PriorityCritical,
	}
	return priorities[rand.Intn(len(priorities))]
}

func randomUser(users []domain.UserID) domain.UserID {
	return users[rand.Intn(len(users))]
}

func randomDueDate() *time.Time {
	if rand.Float32() < 0.5 {
		return nil
	}
	due := time.Now().Add(time.Duration(rand.Intn(30)) * 24 * time.Hour)
	return &due
}

func randomTags() []domain.Tag {
	allTags := []domain.Tag{
		domain.TagBug,
		domain.TagFeature,
		domain.TagEnhancement,
		domain.TagDocumentation,
	}

	numTags := rand.Intn(len(allTags) + 1)
	if numTags == 0 {
		return nil
	}

	tags := make([]domain.Tag, 0, numTags)
	used := make(map[domain.Tag]bool)

	for len(tags) < numTags {
		tag := allTags[rand.Intn(len(allTags))]
		if !used[tag] {
			tags = append(tags, tag)
			used[tag] = true
		}
	}

	return tags
}

Advanced TLA+ Patterns with Claude

Modeling Concurrent Operations

One of TLA+’s strengths is modeling concurrent systems. Let’s extend our specification to handle concurrent task updates:

\* Concurrent task updates with conflict resolution
ConcurrentUpdateTask(taskId, newStatus, version) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ tasks[taskId].version = version  \* Optimistic concurrency control
    /\ tasks' = [tasks EXCEPT ![taskId] = [
                     @ EXCEPT 
                     !.status = newStatus,
                     !.version = @ + 1,
                     !.lastModified = currentUser
                 ]]
    /\ UNCHANGED <<userTasks, nextTaskId, currentUser>>

Prompt to Claude:

Implement optimistic concurrency control for the task updates based on this 
TLA+ specification. Include version tracking and conflict detection.

Modeling Complex Business Rules

TLA+ excels at capturing complex business logic:

\* Business rule: High priority tasks cannot be cancelled directly
ValidStatusTransition(currentStatus, newStatus, priority) ==
    \/ newStatus = currentStatus
    \/ /\ currentStatus = "pending" 
       /\ newStatus \in {"in_progress", "cancelled"}
    \/ /\ currentStatus = "in_progress"
       /\ newStatus \in {"completed", "pending"}
    \/ /\ currentStatus = "in_progress"
       /\ newStatus = "cancelled"
       /\ priority # "high"  \* High priority tasks cannot be cancelled

Lessons Learned

After applying this TLA+ approach to several experimental projects, here are the key insights:

1. Start Small

Begin with core actions and properties. TLA+ specifications can grow complex quickly, so start with the essential behaviors:

\* Start with basic CRUD
Init, CreateTask, UpdateTask, DeleteTask

\* Add complexity incrementally  
Authentication, Authorization, Concurrency, Business Rules

Avoid Initially: Complex distributed systems, performance-critical algorithms

Graduate To: Multi-service interactions, complex business logic

2. Properties Drive Design

Writing TLA+ properties often reveals design flaws before implementation:

\* This property might fail, revealing a design issue
ConsistencyProperty == 
    \A user \in Users:
        \A taskId \in userTasks[user]:
            /\ taskId \in DOMAIN tasks
            /\ tasks[taskId].assignee = user
            /\ tasks[taskId].status # "deleted"  \* Soft delete consideration

3. Model Checking Finds Edge Cases

TLA+ model checking explores execution paths you’d never think to test:

# TLA+ finds this counterexample:
# Step 1: User1 creates Task1
# Step 2: User1 deletes Task1  
# Step 3: User2 creates Task2 (gets same ID due to reuse)
# Step 4: User1 tries to update Task1 -> Security violation!

This led to using UUIDs instead of incrementing integers for task IDs.

4. Generated Tests Are Comprehensive

TLA+ execution traces become your regression test suite. When Claude implements based on TLA+ specs, you get:

  • Complete coverage – All specification paths tested
  • Edge case detection – Boundary conditions from model checking
  • Behavioral contracts – Tests verify actual system properties

Documentation Generation

Prompt to Claude:

Generate API documentation from this TLA+ specification that includes:
1. Endpoint descriptions derived from TLA+ actions
2. Request/response schemas from TLA+ data structures  
3. Error conditions from TLA+ preconditions
4. Behavioral guarantees from TLA+ properties

Code Review Guidelines

With TLA+ specifications, code reviews become more focused:

  1. Does implementation satisfy the TLA+ spec?
  2. Are all preconditions checked?
  3. Do safety properties hold?
  4. Are error conditions handled as specified?

Comparing Specification Approaches

ApproachPrecisionAI EffectivenessMaintenanceLearning CurveTool ComplexityCode Generation
Vibe CodingLowInconsistentHighLowLowN/A
UML/MDDMediumPoorVery HighHighVery HighBrittle
BDD/GherkinMediumBetterMediumMediumLowLimited
TLA+ SpecsHighExcellentLowHighLowReliable

Tools and Resources

Essential TLA+ Resources

  • Learn TLA+: https://learntla.com – Interactive tutorial
  • TLA+ Video Course: Leslie Lamport’s official course
  • Practical TLA+: Hillel Wayne’s book – focus on software systems
  • TLA+ Examples: https://github.com/tlaplus/Examples

Common Mistakes

1. Avoid These Mistakes

? Writing TLA+ like code

\* Wrong - this looks like pseudocode
CreateTask == 
    if currentUser != null then
        task = new Task()

? Writing TLA+ as mathematical relations

\* Right - mathematical specification  
CreateTask == 
    /\ currentUser # NULL
    /\ tasks' = tasks @@ (nextTaskId :> newTask)

? Asking Claude to “fix the TLA+ to match the code”

The spec is the truth – fix the code to match the spec

? Asking Claude to “implement this TLA+ specification correctly”

? Specification scope creep: Starting with entire system architecture ? Incremental approach: Begin with one core workflow, expand gradually

2. Claude Integration Pitfalls

? “Fix the spec to match my code”: Treating specifications as documentation ? “Fix the code to match the spec”: Specifications are the source of truth

3. The Context Overload Trap

Problem: Dumping too much information at once
Solution: Break complex features into smaller, focused requests

4. The “Fix My Test” Antipattern

Problem: When tests fail, asking Claude to modify the test instead of the code
Solution: Always fix the implementation, not the test (unless the test is genuinely wrong)

5. The Blind Trust Mistake

Problem: Accepting generated code without understanding it
Solution: Always review and understand the code before committing

Proven Patterns

1. Save effective prompts:

# ~/.claude/tla-prompts/implementation.md
Implement [language] code that satisfies this TLA+ specification:

[SPEC]

Requirements:
- All TLA+ actions become functions/methods
- All preconditions become runtime checks  
- All data structures match TLA+ types
- Include comprehensive tests covering specification traces

Create specification templates:

--------------------------- MODULE [ModuleName] ---------------------------
EXTENDS Integers, Sequences, FiniteSets

CONSTANTS [Constants]

VARIABLES [StateVariables]

[TypeDefinitions]

Init == [InitialConditions]

[Actions]

Next == [ActionDisjunction]

Spec == Init /\ [][Next]_[StateVariables]

[SafetyProperties]

[LivenessProperties]

=============================================================================

2. The “Explain First” Pattern

Before asking Claude to implement something complex, I ask for an explanation:

Explain how you would implement real-time task updates using WebSockets. 
What are the trade-offs between Socket.io and native WebSockets?
What state management challenges should I consider?

3. The “Progressive Enhancement” Pattern

Start simple, then add complexity:

1. First: "Create a basic task model with CRUD operations"
2. Then: "Add validation and error handling"
3. Then: "Add authentication and authorization"
4. Finally: "Add real-time updates and notifications"

4. The “Code Review” Pattern

After implementation, I ask Claude to review its own code:

Review the task API implementation for:
- Security vulnerabilities
- Performance issues
- Code style consistency
- Missing error cases

Be critical and suggest improvements.

What’s Next

As I’ve developed this TLA+/Claude workflow, I’ve realized we’re approaching something profound: specifications as the primary artifact. Instead of writing code and hoping it’s correct, we’re defining correct behavior formally and letting AI generate the implementation. This inverts the traditional relationship between specification and code.

Implications for Software Engineering

  1. Design-first development becomes natural
  2. Bug prevention replaces bug fixing
  3. Refactoring becomes re-implementation from stable specs
  4. Documentation is always up-to-date (it’s the spec)

I’m currently experimenting with:

  • TLA+ to test case generation – Automated comprehensive testing
  • Multi-language implementations – Same spec, different languages
  • Specification composition – Building larger systems from verified components
  • Quint specifications – A modern executable specification language with simpler syntax than TLA+

Conclusion: The End of Vibe Coding

After using TLA+ with Claude, I can’t go back to vibe coding. The precision, reliability, and confidence that comes from executable specifications has transformed how I build software. The complete working example—TLA+ specs, Go implementation, comprehensive tests, and CI/CD pipeline—is available at github.com/bhatti/sample-task-management.

Yes, there’s a learning curve. Yes, writing TLA+ specifications takes time upfront. But the payoff—in terms of correctness, maintainability, and development speed—is extraordinary. Claude becomes not just a code generator, but a reliable engineering partner that can reason about complex systems precisely because we’ve given it precise specifications to work from. We’re moving from “code and hope” to “specify and know”—and that changes everything.


August 16, 2025

The Complete Guide to gRPC Load Balancing in Kubernetes and Istio

Filed under: Computing,Web Services — Tags: , — admin @ 12:05 pm

TL;DR – The Test Results Matrix

ConfigurationLoad BalancingWhy
Local gRPC? NoneSingle server instance
Kubernetes + gRPC? NoneConnection-level LB only
Kubernetes + Istio? PerfectL7 proxy with request-level LB
Client-side LB?? LimitedRequires multiple endpoints
kubectl port-forward + Istio? NoneBypasses service mesh

Complete test suite ?


Introduction: The gRPC Load Balancing Problem

When you deploy a gRPC service in Kubernetes with multiple replicas, you expect load balancing. You won’t get it. This guide tests every possible configuration to prove why, and shows exactly how to fix it. According to the official gRPC documentation:

“gRPC uses HTTP/2, which multiplexes multiple calls on a single TCP connection. This means that once the connection is established, all gRPC calls will go to the same backend.”


Complete Test Matrix

We’ll test 6 different configurations:

  1. Baseline: Local Testing (Single server)
  2. Kubernetes without Istio (Standard deployment)
  3. Kubernetes with Istio (Service mesh solution)
  4. Client-side Load Balancing (gRPC built-in)
  5. Advanced Connection Testing (Multiple connections)
  6. Real-time Monitoring (Live traffic analysis)

Prerequisites

git clone https://github.com/bhatti/grpc-lb-test
cd grpc-lb-test

# Build all components
make build

Test 1: Baseline – Local Testing

Purpose: Establish baseline behavior with a single server.

# Terminal 1: Start local server
./bin/server

# Terminal 2: Test with basic client
./bin/client -target localhost:50051 -requests 50

Expected Result:

? Load Distribution Results:
Server: unknown-1755316152
Pod: unknown (IP: unknown)
Requests: 50 (100.0%)
????????????????????
? Total servers hit: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.

Analysis: This confirms our client implementation works correctly and establishes the baseline.


Test 2: Kubernetes Without Istio

Purpose: Prove that standard Kubernetes doesn’t provide gRPC request-level load balancing.

Deploy the Service

# Deploy 5 replicas without Istio
./scripts/test-without-istio.sh

The k8s/without-istio/deployment.yaml creates:

  • 5 gRPC server replicas
  • Standard Kubernetes Service
  • No Istio annotations

Test Results

???? Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-gh5z5-1755316388
  Pod: grpc-echo-server-5b657689db-gh5z5 (IP: 10.1.4.148)
  Requests: 30 (100.0%)
  ????????????????????

???? Total servers hit: 1
??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.

???? Connection Analysis:
Without Istio, gRPC maintains a single TCP connection to the Kubernetes Service IP.
The kube-proxy performs L4 load balancing, but gRPC reuses the same connection.

???? Cleaning up...
deployment.apps "grpc-echo-server" deleted
service "grpc-echo-service" deleted
./scripts/test-without-istio.sh: line 57: 17836 Terminated: 15   
kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1

??  RESULT: No load balancing observed - all requests went to single pod!

Why This Happens

The Kubernetes Service documentation explains:

“For each Service, kube-proxy installs iptables rules which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service’s backend endpoints.”

Kubernetes Services perform L4 (connection-level) load balancing, but gRPC maintains persistent connections.

Connection Analysis

Run the analysis tool to see connection behavior:

./bin/analysis -target localhost:50051 -requests 100 -test-scenarios true

Result:

? NO LOAD BALANCING: All requests to single server

???? Connection Reuse Analysis:
  Average requests per connection: 1.00
  ??  Low connection reuse (many short connections)

? Connection analysis complete!

Test 3: Kubernetes With Istio

Purpose: Demonstrate how Istio’s L7 proxy solves the load balancing problem.

Install Istio

./scripts/install-istio.sh

This follows Istio’s official installation guide:

istioctl install --set profile=demo -y
kubectl label namespace default istio-injection=enabled

Deploy With Istio

./scripts/test-with-istio.sh

The k8s/with-istio/deployment.yaml includes:

annotations:
  sidecar.istio.io/inject: "true"
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-echo-service
spec:
  host: grpc-echo-service
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 100
        maxRequestsPerConnection: 10
    loadBalancer:
      simple: ROUND_ROBIN

Critical Testing Gotcha

? Wrong way (what most people do):

kubectl port-forward service/grpc-echo-service 50051:50051
./bin/client -target localhost:50051 -requests 50
# Result: Still no load balancing!

According to Istio’s architecture docs, kubectl port-forward bypasses the Envoy sidecar proxy.

? Correct Testing Method

Test from inside the service mesh:

./scripts/test-with-istio.sh

Test Results With Istio

???? Load Distribution Results:
================================

Server: grpc-echo-server-579dfbc76b-m2v7x-1755357769
  Pod: grpc-echo-server-579dfbc76b-m2v7x (IP: 10.1.4.237)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-fkgkk-1755357769
  Pod: grpc-echo-server-579dfbc76b-fkgkk (IP: 10.1.4.240)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-bsjdv-1755357769
  Pod: grpc-echo-server-579dfbc76b-bsjdv (IP: 10.1.4.241)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-dw2m7-1755357770
  Pod: grpc-echo-server-579dfbc76b-dw2m7 (IP: 10.1.4.236)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-x85jm-1755357769
  Pod: grpc-echo-server-579dfbc76b-x85jm (IP: 10.1.4.238)
  Requests: 10 (20.0%)
  ????????

???? Total unique servers: 5

? Load balancing detected across 5 servers!
   Expected requests per server: 10.0
   Distribution variance: 0.00

How Istio Solves This

From Istio’s traffic management documentation:

“Envoy proxies are deployed as sidecars to services, logically augmenting the services with traffic management capabilities… Envoy proxies are the only Istio components that interact with data plane traffic.”

Istio’s solution:

  1. Envoy sidecar intercepts all traffic
  2. Performs L7 (application-level) load balancing
  3. Maintains connection pools to all backends
  4. Routes each request independently

Test 4: Client-Side Load Balancing

Purpose: Test gRPC’s built-in client-side load balancing capabilities.

Standard Client-Side LB

./scripts/test-client-lb.sh

The cmd/client-lb/main.go implements gRPC’s native load balancing:

conn, err := grpc.Dial(
    "dns:///"+target,
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Results and Limitations

 Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-g9pbw-1755359830
  Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
  Requests: 10 (100.0%)
  ????????????????????

???? Total servers hit: 1
??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
? Normal client works - service is accessible

???? Test 2: Client-side round-robin (from inside cluster)
?????????????????????????????????????????????????????
Creating test pod inside cluster for proper DNS resolution...
pod "client-lb-test" deleted
./scripts/test-client-lb.sh: line 71: 48208 Terminated: 15          kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1

??  Client-side LB limitation explanation:
   gRPC client-side round-robin expects multiple A records
   But Kubernetes Services return only one ClusterIP
   Result: 'no children to pick from' error

???? What happens with client-side LB:
   1. Client asks DNS for: grpc-echo-service
   2. DNS returns: 10.105.177.23 (single IP)
   3. gRPC round-robin needs: multiple IPs for load balancing
   4. Result: Error 'no children to pick from'

? This proves client-side LB doesn't work with K8s Services!

???? Test 3: Demonstrating the DNS limitation
?????????????????????????????????????????????
What gRPC client-side LB sees:
   Service name: grpc-echo-service:50051
   DNS resolution: 10.105.177.23:50051
   Available endpoints: 1 (needs multiple for round-robin)

What gRPC client-side LB needs:
   Multiple A records from DNS, like:
   grpc-echo-service ? 10.1.4.241:50051
   grpc-echo-service ? 10.1.4.240:50051
   grpc-echo-service ? 10.1.4.238:50051
   (But Kubernetes Services don't provide this)

???? Test 4: Alternative - Multiple connections
????????????????????????????????????????????
Testing alternative approach with multiple connections...

???? Configuration:
   Target: localhost:50052
   API: grpc.Dial
   Load Balancing: round-robin
   Multi-endpoint: true
   Requests: 20

???? Using multi-endpoint resolver

???? Sending 20 unary requests...

? Request 1 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 2 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 3 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 4 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 5 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 6 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 7 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 8 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 9 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 10 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 11 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)

? Successful requests: 20/20

???? Load Distribution Results:
================================

Server: grpc-echo-server-5b657689db-g9pbw-1755359830
  Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
  Requests: 20 (100.0%)
  ????????????????????????????????????????

???? Total unique servers: 1

??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
   This is expected for gRPC without Istio or special configuration.
? Multi-connection approach works!
   (This simulates multiple endpoints for testing)

???????????????????????????????????????????????????????????????
                         SUMMARY
???????????????????????????????????????????????????????????????

? KEY FINDINGS:
   • Standard gRPC client: Works (uses single connection)
   • Client-side round-robin: Fails (needs multiple IPs)
   • Kubernetes DNS: Returns single ClusterIP only
   • Alternative: Multiple connections can work

???? CONCLUSION:
   Client-side load balancing doesn't work with standard
   Kubernetes Services because they provide only one IP address.
   This proves why Istio (L7 proxy) is needed for gRPC load balancing!

Why this fails: Kubernetes Services provide a single ClusterIP, not multiple IPs for DNS resolution.

From the gRPC load balancing documentation:

“The gRPC client will use the list of IP addresses returned by the name resolver and distribute RPCs among them.”

Alternative: Multiple Connections

Start five instances of servers with different ports:

# Terminal 1
GRPC_PORT=50051 ./bin/server

# Terminal 2  
GRPC_PORT=50052 ./bin/server

# Terminal 3
GRPC_PORT=50053 ./bin/server

# Terminal 4
GRPC_PORT=50054 ./bin/server

# Terminal 5
GRPC_PORT=50055 ./bin/server

The cmd/client-v2/main.go implements manual connection management:

./bin/client-v2 -target localhost:50051 -requests 50 -multi-endpoint

Results:

???? Load Distribution Results:
================================

Server: unknown-1755360953
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360963
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360970
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360980
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360945
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

???? Total unique servers: 5

? Load balancing detected across 5 servers!
   Expected requests per server: 10.0
   Distribution variance: 0.00

Test 5: Advanced Connection Testing

Purpose: Analyze connection patterns and performance implications.

Multiple Connection Strategy

./bin/advanced-client \
  -target localhost:50051 \
  -requests 1000 \
  -clients 10 \
  -connections 5

Results:

???? Detailed Load Distribution Results:
=====================================
Test Duration: 48.303709ms
Total Requests: 1000
Failed Requests: 0
Requests/sec: 20702.34

Server Distribution:

Server: unknown-1755360945
  Pod: unknown (IP: unknown)
  Requests: 1000 (100.0%)
  First seen: 09:18:51.842
  Last seen: 09:18:51.874
  ????????????????????????????????????????

???? Analysis:
Total unique servers: 1
Average requests per server: 1000.00
Standard deviation: 0.00

??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
   This is expected behavior for gRPC without Istio.

Even sophisticated connection pooling can’t overcome the fundamental issue:
• Multiple connections to SAME endpoint = same server
• Advanced client techniques ? load balancing
• Connection management ? request distribution

Performance Comparison

./scripts/benchmark.sh

???? Key Insights:
• Single server: High performance, no load balancing
• Multiple connections: Same performance, still no LB
• Kubernetes: Small overhead, still no LB
• Istio: Small additional overhead, but enables LB
• Client-side LB: Complex setup, limited effectiveness


Official Documentation References

gRPC Load Balancing

From the official gRPC blog:

“Load balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we want to distribute them across all servers.”

The problem: Standard deployments don’t achieve per-call balancing.

Istio’s Solution

From Istio’s service mesh documentation:

“Istio’s data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. These proxies mediate and control all network communication between microservices.”

Kubernetes Service Limitations

From Kubernetes networking concepts:

“kube-proxy… only supports TCP and UDP… doesn’t understand HTTP and doesn’t provide load balancing for HTTP requests.”


Complete Test Results Summary

After running comprehensive tests across all possible gRPC load balancing configurations, here are the definitive results that prove the fundamental limitations and solutions:

???? Core Test Matrix Results

ConfigurationLoad BalancingServers HitDistributionKey Insight
Local gRPC? None1/1 (100%)Single serverBaseline behavior confirmed
Kubernetes + gRPC? None1/5 (100%)Single podK8s Services don’t solve it
Kubernetes + Istio? Perfect5/5 (20% each)Even distributionIstio enables true LB
Client-side LB? Failed1/5 (100%)Single podDNS limitation fatal
kubectl port-forward + Istio? None1/5 (100%)Single podTesting methodology matters
Advanced multi-connection? None1/1 (100%)Single endpointComplex ? effective

???? Detailed Test Scenario Analysis

Scenario 1: Baseline Tests

Local single server:     ? PASS - 50 requests ? 1 server (100%)
Local multiple conn:     ? PASS - 1000 requests ? 1 server (100%)

Insight: Confirms gRPC’s connection persistence behavior. Multiple connections to same endpoint don’t change distribution.

Scenario 2: Kubernetes Standard Deployment

K8s without Istio:      ? PASS - 50 requests ? 1 pod (100%)
Expected behavior:      ? NO load balancing
Actual behavior:        ? NO load balancing  

Insight: Standard Kubernetes deployment with 5 replicas provides zero request-level load balancing for gRPC services.

Scenario 3: Istio Service Mesh

K8s with Istio (port-forward):  ??  BYPASS - 50 requests ? 1 pod (100%)
K8s with Istio (in-mesh):       ? SUCCESS - 50 requests ? 5 pods (20% each)

Insight: Istio provides perfect load balancing when tested correctly. Port-forward testing gives false negatives.

Scenario 4: Client-Side Approaches

DNS round-robin:        ? FAIL - "no children to pick from"
Multi-endpoint client:  ? PARTIAL - Works with manual endpoint management
Advanced connections:   ? FAIL - Still single endpoint limitation

Insight: Client-side solutions are complex, fragile, and limited in Kubernetes environments.

???? Deep Technical Analysis

The DNS Problem (Root Cause)

Our testing revealed the fundamental architectural issue:

# What Kubernetes provides
nslookup grpc-echo-service
? 10.105.177.23 (single ClusterIP)

# What gRPC client-side LB needs  
nslookup grpc-echo-service
? 10.1.4.241, 10.1.4.242, 10.1.4.243, 10.1.4.244, 10.1.4.245 (multiple IPs)

Impact: This single vs. multiple IP difference makes client-side load balancing architecturally impossible with standard Kubernetes Services.

Connection Persistence Evidence

Our advanced client test with 1000 requests, 10 concurrent clients, and 5 connections:

Test Duration: 48ms
Requests/sec: 20,702
Servers Hit: 1 (100%)
Connection Reuse: Perfect (efficient but unbalanced)

Conclusion: Even sophisticated connection management can’t overcome the single-endpoint limitation.

Istio’s L7 Magic

Comparing the same test scenario:

# Without Istio
50 requests ? grpc-echo-server-abc123 (100%)

# With Istio  
50 requests ? 5 different pods (20% each)
Distribution variance: 0.00 (perfect)

Technical Detail: Istio’s Envoy sidecar performs request-level routing, creating independent routing decisions for each gRPC call.

? Performance Impact Analysis

Based on our benchmark results:

ConfigurationReq/sOverheadLoad BalancingProduction Suitable
Local baseline~25,0000%None? Not scalable
K8s standard~22,00012%None? Unbalanced
K8s + Istio~20,00020%Perfect? Recommended
Client-side~23,0008%Complex?? Maintenance burden

Insight: Istio’s 20% performance overhead is a reasonable trade-off for enabling proper load balancing and gaining a production-ready service mesh.


Production Recommendations

For Development Teams:

  1. Standard Kubernetes deployment of gRPC services will not load balance
  2. Istio is the proven solution for production gRPC load balancing
  3. Client-side approaches add complexity without solving the fundamental issue
  4. Testing methodology critically affects results (avoid port-forward for Istio tests)

For Architecture Decisions:

  1. Plan for Istio if deploying multiple gRPC services
  2. Accept the 20% performance cost for operational benefits
  3. Avoid client-side load balancing in Kubernetes environments
  4. Use proper testing practices to validate service mesh behavior

For Production Readiness:

  1. Istio + DestinationRules provide enterprise-grade gRPC load balancing
  2. Monitoring and observability come built-in with Istio
  3. Circuit breaking and retry policies integrate seamlessly
  4. Zero client-side complexity reduces maintenance burden

???? Primary Recommendation: Istio Service Mesh

Our testing proves Istio is the only solution that provides reliable gRPC load balancing in Kubernetes:

# Production-tested DestinationRule configuration
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service-production
spec:
  host: grpc-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10  # Tested: Ensures request distribution
        connectTimeout: 30s
    loadBalancer:
      simple: LEAST_REQUEST  # Better than ROUND_ROBIN for varying request costs
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Why this configuration works:

  • maxRequestsPerConnection: 10 – Forces connection rotation (tested in our scenario)
  • LEAST_REQUEST – Better performance than round-robin for real workloads
  • outlierDetection – Automatic failure handling (something client-side LB can’t provide)

Expected results based on our testing:

  • ? Perfect 20% distribution across 5 replicas
  • ? ~20% performance overhead (trade-off worth it)
  • ? Built-in observability and monitoring
  • ? Zero client-side complexity

???? Configuration Best Practices

1. Enable Istio Injection Properly

# Enable for entire namespace (recommended)
kubectl label namespace production istio-injection=enabled

# Or per-deployment (more control)
metadata:
  annotations:
    sidecar.istio.io/inject: "true"

2. Validate Load Balancing is Working

# WRONG: This will show false negatives
kubectl port-forward service/grpc-service 50051:50051

# CORRECT: Test from inside the mesh
kubectl run test-client --rm -it --restart=Never \
  --image=your-grpc-client \
  --annotations="sidecar.istio.io/inject=true" \
  -- ./client -target grpc-service:50051 -requests 100

3. Monitor Distribution Quality

# Check Envoy stats for load balancing
kubectl exec deployment/grpc-service -c istio-proxy -- \
  curl localhost:15000/stats | grep upstream_rq_

?? What NOT to Do (Based on Our Test Failures)

1. Don’t Rely on Standard Kubernetes Services

# This WILL NOT load balance gRPC traffic
apiVersion: v1
kind: Service
metadata:
  name: grpc-service
spec:
  ports:
  - port: 50051
  selector:
    app: grpc-server
# Result: 100% traffic to single pod (proven in our tests)

2. Don’t Use Client-Side Load Balancing

// This approach FAILS in Kubernetes (tested and failed)
conn, err := grpc.Dial(
    "dns:///grpc-service:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
// Error: "no children to pick from" (proven in our tests)

3. Don’t Implement Complex Connection Pooling

// This adds complexity without solving the core issue
type LoadBalancedClient struct {
    conns []grpc.ClientConnInterface
    next  int64
}
// Still results in 100% traffic to single endpoint (proven in our tests)

???? Alternative Solutions (If Istio Not Available)

If you absolutely cannot use Istio, here are the only viable alternatives (with significant caveats):

Option 1: External Load Balancer with HTTP/2 Support

# Use nginx/envoy/haproxy outside Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: grpc-service-lb
spec:
  type: LoadBalancer
  ports:
  - port: 50051
    targetPort: 50051

Limitations: Requires external infrastructure, loss of Kubernetes-native benefits

Option 2: Headless Service + Custom Service Discovery

apiVersion: v1
kind: Service
metadata:
  name: grpc-service-headless
spec:
  clusterIP: None  # Headless service
  ports:
  - port: 50051
  selector:
    app: grpc-server

Limitations: Complex client implementation, manual health checking


Conclusion

After testing every possible gRPC load balancing configuration in Kubernetes, the evidence is clear and definitive:

  • Standard Kubernetes + gRPC = Zero load balancing (100% traffic to single pod)
  • The problem is architectural, not implementation
  • Client-side solutions fail due to DNS limitations (“no children to pick from”)
  • Complex workarounds add overhead without solving the core issue

???? Istio is the Proven Solution

The evidence overwhelmingly supports Istio as the production solution:

  • ? Perfect load balancing: 20% distribution across 5 pods (0.00 variance)
  • ? Reasonable overhead: 20% performance cost for complete solution
  • ? Production features: Circuit breaking, retries, observability included
  • ? Zero client complexity: Works transparently with existing gRPC clients

???? Critical Testing Insight

Our testing revealed a major pitfall that leads to incorrect conclusions:

  • kubectl port-forward bypasses Istio ? false negative results
  • Most developers get wrong results when testing Istio + gRPC
  • Always test from inside the service mesh for accurate results

Full test suite and results ?

August 15, 2025

Building Robust Error Handling with gRPC and REST APIs

Filed under: Computing,Web Services — admin @ 2:23 pm

Introduction

Error handling is often an afterthought in API development, yet it’s one of the most critical aspects of a good developer experience. For example, a cryptic error message like { "error": "An error occurred" } can lead to hours of frustrating debugging. In this guide, we will build a robust, production-grade error handling framework for a Go application that serves both gRPC and a REST/HTTP proxy based on industry standards like RFC9457 (Problem Details for HTTP APIs) and RFC7807 (obsoleted).

Tenets

Following are tenets of a great API error:

  1. Structured: machine-readable, not just a string.
  2. Actionable: explains the developer why the error occurred and, if possible, how to fix it.
  3. Consistent: all errors, from validation to authentication to server faults, follow the same format.
  4. Secure: never leaks sensitive internal information like stack traces or database schemas.

Our North Star for HTTP errors will be the Problem Details for HTTP APIs (RFC 9457/7807):

{
  "type": "https://example.com/docs/errors/validation-failed",
  "title": "Validation Failed",
  "status": 400,
  "detail": "The request body failed validation.",
  "instance": "/v1/todos",
  "invalid_params": [
    {
      "field": "title",
      "reason": "must not be empty"
    }
  ]
}

We will adapt this model for gRPC by embedding a similar structure in the gRPC status details, creating a single source of truth for all errors.

API Design

Let’s start by defining our TODO API in Protocol Buffers:

syntax = "proto3";

package todo.v1;

import "google/api/annotations.proto";
import "google/api/field_behavior.proto";
import "google/api/resource.proto";
import "google/protobuf/timestamp.proto";
import "google/protobuf/field_mask.proto";
import "buf/validate/validate.proto";

option go_package = "github.com/bhatti/todo-api-errors/api/proto/todo/v1;todo";

// TodoService provides task management operations
service TodoService {
  // CreateTask creates a new task
  rpc CreateTask(CreateTaskRequest) returns (Task) {
    option (google.api.http) = {
      post: "/v1/tasks"
      body: "*"
    };
  }

  // GetTask retrieves a specific task
  rpc GetTask(GetTaskRequest) returns (Task) {
    option (google.api.http) = {
      get: "/v1/{name=tasks/*}"
    };
  }

  // ListTasks retrieves all tasks
  rpc ListTasks(ListTasksRequest) returns (ListTasksResponse) {
    option (google.api.http) = {
      get: "/v1/tasks"
    };
  }

  // UpdateTask updates an existing task
  rpc UpdateTask(UpdateTaskRequest) returns (Task) {
    option (google.api.http) = {
      patch: "/v1/{task.name=tasks/*}"
      body: "task"
    };
  }

  // DeleteTask removes a task
  rpc DeleteTask(DeleteTaskRequest) returns (DeleteTaskResponse) {
    option (google.api.http) = {
      delete: "/v1/{name=tasks/*}"
    };
  }

  // BatchCreateTasks creates multiple tasks at once
  rpc BatchCreateTasks(BatchCreateTasksRequest) returns (BatchCreateTasksResponse) {
    option (google.api.http) = {
      post: "/v1/tasks:batchCreate"
      body: "*"
    };
  }
}

// Task represents a TODO item
message Task {
  option (google.api.resource) = {
    type: "todo.example.com/Task"
    pattern: "tasks/{task}"
    singular: "task"
    plural: "tasks"
  };

  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = IDENTIFIER,
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Task title
  string title = 2 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).string = {
      min_len: 1
      max_len: 200
    }
  ];

  // Task description
  string description = 3 [
    (google.api.field_behavior) = OPTIONAL,
    (buf.validate.field).string = {
      max_len: 1000
    }
  ];

  // Task status
  Status status = 4 [
    (google.api.field_behavior) = REQUIRED
  ];

  // Task priority
  Priority priority = 5 [
    (google.api.field_behavior) = OPTIONAL
  ];

  // Due date for the task
  google.protobuf.Timestamp due_date = 6 [
    (google.api.field_behavior) = OPTIONAL,
    (buf.validate.field).timestamp = {
      gt_now: true
    }
  ];

  // Task creation time
  google.protobuf.Timestamp create_time = 7 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Task last update time
  google.protobuf.Timestamp update_time = 8 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // User who created the task
  string created_by = 9 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Tags associated with the task
  repeated string tags = 10 [
    (buf.validate.field).repeated = {
      max_items: 10
      items: {
        string: {
          pattern: "^[a-z0-9-]+$"
          max_len: 50
        }
      }
    }
  ];
}

// Task status enumeration
enum Status {
  STATUS_UNSPECIFIED = 0;
  STATUS_PENDING = 1;
  STATUS_IN_PROGRESS = 2;
  STATUS_COMPLETED = 3;
  STATUS_CANCELLED = 4;
}

// Task priority enumeration
enum Priority {
  PRIORITY_UNSPECIFIED = 0;
  PRIORITY_LOW = 1;
  PRIORITY_MEDIUM = 2;
  PRIORITY_HIGH = 3;
  PRIORITY_CRITICAL = 4;
}

// CreateTaskRequest message
message CreateTaskRequest {
  // Task to create
  Task task = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];
}

// GetTaskRequest message
message GetTaskRequest {
  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = REQUIRED,
    (google.api.resource_reference) = {
      type: "todo.example.com/Task"
    },
    (buf.validate.field).string = {
      pattern: "^tasks/[a-zA-Z0-9-]+$"
    }
  ];
}

// ListTasksRequest message
message ListTasksRequest {
  // Maximum number of tasks to return
  int32 page_size = 1 [
    (buf.validate.field).int32 = {
      gte: 0
      lte: 1000
    }
  ];

  // Page token for pagination
  string page_token = 2;

  // Filter expression
  string filter = 3;

  // Order by expression
  string order_by = 4;
}

// ListTasksResponse message
message ListTasksResponse {
  // List of tasks
  repeated Task tasks = 1;

  // Token for next page
  string next_page_token = 2;

  // Total number of tasks
  int32 total_size = 3;
}

// UpdateTaskRequest message
message UpdateTaskRequest {
  // Task to update
  Task task = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];

  // Fields to update
  google.protobuf.FieldMask update_mask = 2 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];
}

// DeleteTaskRequest message
message DeleteTaskRequest {
  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = REQUIRED,
    (google.api.resource_reference) = {
      type: "todo.example.com/Task"
    }
  ];
}

// DeleteTaskResponse message
message DeleteTaskResponse {
  // Confirmation message
  string message = 1;
}

// BatchCreateTasksRequest message
message BatchCreateTasksRequest {
  // Tasks to create
  repeated CreateTaskRequest requests = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).repeated = {
      min_items: 1
      max_items: 100
    }
  ];
}

// BatchCreateTasksResponse message
message BatchCreateTasksResponse {
  // Created tasks
  repeated Task tasks = 1;
}
syntax = "proto3";

package errors.v1;

import "google/protobuf/timestamp.proto";
import "google/protobuf/any.proto";

option go_package = "github.com/bhatti/todo-api-errors/api/proto/errors/v1;errors";

// ErrorDetail provides a structured, machine-readable error payload.
// It is designed to be embedded in the `details` field of a `google.rpc.Status` message.
message ErrorDetail {
  // A unique, application-specific error code.
  string code = 1;
  // A short, human-readable summary of the problem type.
  string title = 2;
  // A human-readable explanation specific to this occurrence of the problem.
  string detail = 3;
  // A list of validation errors, useful for INVALID_ARGUMENT responses.
  repeated FieldViolation field_violations = 4;
  // Optional trace ID for request correlation
  string trace_id = 5;
  // Optional timestamp when the error occurred
  google.protobuf.Timestamp timestamp = 6;
  // Optional instance path where the error occurred
  string instance = 7;
  // Optional extensions for additional error context
  map<string, google.protobuf.Any> extensions = 8;
}

// Describes a single validation failure.
message FieldViolation {
  // The path to the field that failed validation, e.g., "title".
  string field = 1;
  // A developer-facing description of the validation rule that failed.
  string description = 2;
  // Application-specific error code for this validation failure
  string code = 3;
}

// AppErrorCode defines a list of standardized, application-specific error codes.
enum AppErrorCode {
  APP_ERROR_CODE_UNSPECIFIED = 0;

  // Validation failures
  VALIDATION_FAILED = 1;
  REQUIRED_FIELD = 2;
  TOO_SHORT = 3;
  TOO_LONG = 4;
  INVALID_FORMAT = 5;
  MUST_BE_FUTURE = 6;
  INVALID_VALUE = 7;
  DUPLICATE_TAG = 8;
  INVALID_TAG_FORMAT = 9;
  OVERDUE_COMPLETION = 10;
  EMPTY_BATCH = 11;
  BATCH_TOO_LARGE = 12;
  DUPLICATE_TITLE = 13;

  // Resource errors
  RESOURCE_NOT_FOUND = 1001;
  RESOURCE_CONFLICT = 1002;

  // Authentication and authorization
  AUTHENTICATION_FAILED = 2001;
  PERMISSION_DENIED = 2002;

  // Rate limiting and service availability
  RATE_LIMIT_EXCEEDED = 3001;
  SERVICE_UNAVAILABLE = 3002;

  // Internal errors
  INTERNAL_ERROR = 9001;
}

Error Handling Implementation

Now let’s implement our error handling framework:

package errors

import (
	"fmt"

	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	"google.golang.org/genproto/googleapis/rpc/errdetails"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/types/known/anypb"
	"google.golang.org/protobuf/types/known/timestamppb"
)

// AppError is our custom error type using protobuf definitions.
type AppError struct {
	GRPCCode        codes.Code
	AppCode         errorspb.AppErrorCode
	Title           string
	Detail          string
	FieldViolations []*errorspb.FieldViolation
	TraceID         string
	Instance        string
	Extensions      map[string]*anypb.Any
	CausedBy        error // For internal logging
}

func (e *AppError) Error() string {
	return fmt.Sprintf("gRPC Code: %s, App Code: %s, Title: %s, Detail: %s", e.GRPCCode, e.AppCode, e.Title, e.Detail)
}

// ToGRPCStatus converts our AppError into a gRPC status.Status.
func (e *AppError) ToGRPCStatus() *status.Status {
	st := status.New(e.GRPCCode, e.Title)

	errorDetail := &errorspb.ErrorDetail{
		Code:            e.AppCode.String(),
		Title:           e.Title,
		Detail:          e.Detail,
		FieldViolations: e.FieldViolations,
		TraceId:         e.TraceID,
		Timestamp:       timestamppb.Now(),
		Instance:        e.Instance,
		Extensions:      e.Extensions,
	}

	// For validation errors, we also attach the standard BadRequest detail
	// so that gRPC-Gateway and other standard tools can understand it.
	if e.GRPCCode == codes.InvalidArgument && len(e.FieldViolations) > 0 {
		br := &errdetails.BadRequest{}
		for _, fv := range e.FieldViolations {
			br.FieldViolations = append(br.FieldViolations, &errdetails.BadRequest_FieldViolation{
				Field:       fv.Field,
				Description: fv.Description,
			})
		}
		st, _ = st.WithDetails(br, errorDetail)
		return st
	}

	st, _ = st.WithDetails(errorDetail)
	return st
}

// Helper functions for creating common errors

func NewValidationFailed(violations []*errorspb.FieldViolation, traceID string) *AppError {
	return &AppError{
		GRPCCode:        codes.InvalidArgument,
		AppCode:         errorspb.AppErrorCode_VALIDATION_FAILED,
		Title:           "Validation Failed",
		Detail:          fmt.Sprintf("The request contains %d validation errors", len(violations)),
		FieldViolations: violations,
		TraceID:         traceID,
	}
}

func NewNotFound(resource string, id string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.NotFound,
		AppCode:  errorspb.AppErrorCode_RESOURCE_NOT_FOUND,
		Title:    "Resource Not Found",
		Detail:   fmt.Sprintf("%s with ID '%s' was not found.", resource, id),
		TraceID:  traceID,
	}
}

func NewConflict(resource, reason string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.AlreadyExists,
		AppCode:  errorspb.AppErrorCode_RESOURCE_CONFLICT,
		Title:    "Resource Conflict",
		Detail:   fmt.Sprintf("Conflict creating %s: %s", resource, reason),
		TraceID:  traceID,
	}
}

func NewInternal(message string, traceID string, causedBy error) *AppError {
	return &AppError{
		GRPCCode: codes.Internal,
		AppCode:  errorspb.AppErrorCode_INTERNAL_ERROR,
		Title:    "Internal Server Error",
		Detail:   message,
		TraceID:  traceID,
		CausedBy: causedBy,
	}
}

func NewPermissionDenied(resource, action string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.PermissionDenied,
		AppCode:  errorspb.AppErrorCode_PERMISSION_DENIED,
		Title:    "Permission Denied",
		Detail:   fmt.Sprintf("You don't have permission to %s %s", action, resource),
		TraceID:  traceID,
	}
}

func NewServiceUnavailable(message string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.Unavailable,
		AppCode:  errorspb.AppErrorCode_SERVICE_UNAVAILABLE,
		Title:    "Service Unavailable",
		Detail:   message,
		TraceID:  traceID,
	}
}

func NewRequiredField(field, message string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.InvalidArgument,
		AppCode:  errorspb.AppErrorCode_VALIDATION_FAILED,
		Title:    "Validation Failed",
		Detail:   "The request contains validation errors",
		FieldViolations: []*errorspb.FieldViolation{
			{
				Field:       field,
				Code:        errorspb.AppErrorCode_REQUIRED_FIELD.String(),
				Description: message,
			},
		},
		TraceID: traceID,
	}
}

Validation Framework

Let’s implement validation that returns all errors at once:

package validation

import (
	"errors"
	"fmt"
	"regexp"
	"strings"

	"buf.build/gen/go/bufbuild/protovalidate/protocolbuffers/go/buf/validate"
	"buf.build/go/protovalidate"
	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"google.golang.org/protobuf/proto"
)

var pv protovalidate.Validator

func init() {
	var err error
	pv, err = protovalidate.New()
	if err != nil {
		panic(fmt.Sprintf("failed to initialize protovalidator: %v", err))
	}
}

// ValidateRequest checks a proto message and returns an AppError with all violations.
func ValidateRequest(req proto.Message, traceID string) error {
	if err := pv.Validate(req); err != nil {
		var validationErrs *protovalidate.ValidationError
		if errors.As(err, &validationErrs) {
			var violations []*errorspb.FieldViolation
			for _, violation := range validationErrs.Violations {
				fieldPath := ""
				if violation.Proto.GetField() != nil {
					fieldPath = formatFieldPath(violation.Proto.GetField())
				}

				ruleId := violation.Proto.GetRuleId()
				message := violation.Proto.GetMessage()

				violations = append(violations, &errorspb.FieldViolation{
					Field:       fieldPath,
					Description: message,
					Code:        mapConstraintToCode(ruleId),
				})
			}
			return apperrors.NewValidationFailed(violations, traceID)
		}
		return apperrors.NewInternal("Validation failed", traceID, err)
	}
	return nil
}

// ValidateTask performs additional business logic validation
func ValidateTask(task *todopb.Task, traceID string) error {
	var violations []*errorspb.FieldViolation

	// Proto validation first
	if err := ValidateRequest(task, traceID); err != nil {
		if appErr, ok := err.(*apperrors.AppError); ok {
			violations = append(violations, appErr.FieldViolations...)
		}
	}

	// Additional business rules
	if task.Status == todopb.Status_STATUS_COMPLETED && task.DueDate != nil {
		if task.UpdateTime != nil && task.UpdateTime.AsTime().After(task.DueDate.AsTime()) {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       "due_date",
				Code:        errorspb.AppErrorCode_OVERDUE_COMPLETION.String(),
				Description: "Task was completed after the due date",
			})
		}
	}

	// Validate tags format
	for i, tag := range task.Tags {
		if !isValidTag(tag) {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("tags[%d]", i),
				Code:        errorspb.AppErrorCode_INVALID_TAG_FORMAT.String(),
				Description: fmt.Sprintf("Tag '%s' must be lowercase letters, numbers, and hyphens only", tag),
			})
		}
	}

	// Check for duplicate tags
	tagMap := make(map[string]bool)
	for i, tag := range task.Tags {
		if tagMap[tag] {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("tags[%d]", i),
				Code:        errorspb.AppErrorCode_DUPLICATE_TAG.String(),
				Description: fmt.Sprintf("Tag '%s' appears multiple times", tag),
			})
		}
		tagMap[tag] = true
	}

	if len(violations) > 0 {
		return apperrors.NewValidationFailed(violations, traceID)
	}

	return nil
}

// ValidateBatchCreateTasks validates batch operations
func ValidateBatchCreateTasks(req *todopb.BatchCreateTasksRequest, traceID string) error {
	var violations []*errorspb.FieldViolation

	// Check batch size
	if len(req.Requests) == 0 {
		violations = append(violations, &errorspb.FieldViolation{
			Field:       "requests",
			Code:        errorspb.AppErrorCode_EMPTY_BATCH.String(),
			Description: "Batch must contain at least one task",
		})
	}

	if len(req.Requests) > 100 {
		violations = append(violations, &errorspb.FieldViolation{
			Field:       "requests",
			Code:        errorspb.AppErrorCode_BATCH_TOO_LARGE.String(),
			Description: fmt.Sprintf("Batch size %d exceeds maximum of 100", len(req.Requests)),
		})
	}

	// Validate each task
	for i, createReq := range req.Requests {
		if createReq.Task == nil {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("requests[%d].task", i),
				Code:        errorspb.AppErrorCode_REQUIRED_FIELD.String(),
				Description: "Task is required",
			})
			continue
		}

		// Validate task
		if err := ValidateTask(createReq.Task, traceID); err != nil {
			if appErr, ok := err.(*apperrors.AppError); ok {
				for _, violation := range appErr.FieldViolations {
					violation.Field = fmt.Sprintf("requests[%d].task.%s", i, violation.Field)
					violations = append(violations, violation)
				}
			}
		}
	}

	// Check for duplicate titles
	titleMap := make(map[string][]int)
	for i, createReq := range req.Requests {
		if createReq.Task != nil && createReq.Task.Title != "" {
			titleMap[createReq.Task.Title] = append(titleMap[createReq.Task.Title], i)
		}
	}

	for title, indices := range titleMap {
		if len(indices) > 1 {
			for _, idx := range indices {
				violations = append(violations, &errorspb.FieldViolation{
					Field:       fmt.Sprintf("requests[%d].task.title", idx),
					Code:        errorspb.AppErrorCode_DUPLICATE_TITLE.String(),
					Description: fmt.Sprintf("Title '%s' is used by multiple tasks in the batch", title),
				})
			}
		}
	}

	if len(violations) > 0 {
		return apperrors.NewValidationFailed(violations, traceID)
	}

	return nil
}

// Helper functions
func formatFieldPath(fieldPath *validate.FieldPath) string {
	if fieldPath == nil {
		return ""
	}

	// Build field path from elements
	var parts []string
	for _, element := range fieldPath.GetElements() {
		if element.GetFieldName() != "" {
			parts = append(parts, element.GetFieldName())
		} else if element.GetFieldNumber() != 0 {
			parts = append(parts, fmt.Sprintf("field_%d", element.GetFieldNumber()))
		}
	}

	return strings.Join(parts, ".")
}

func mapConstraintToCode(ruleId string) string {
	switch {
	case strings.Contains(ruleId, "required"):
		return errorspb.AppErrorCode_REQUIRED_FIELD.String()
	case strings.Contains(ruleId, "min_len"):
		return errorspb.AppErrorCode_TOO_SHORT.String()
	case strings.Contains(ruleId, "max_len"):
		return errorspb.AppErrorCode_TOO_LONG.String()
	case strings.Contains(ruleId, "pattern"):
		return errorspb.AppErrorCode_INVALID_FORMAT.String()
	case strings.Contains(ruleId, "gt_now"):
		return errorspb.AppErrorCode_MUST_BE_FUTURE.String()
	case ruleId == "":
		return errorspb.AppErrorCode_VALIDATION_FAILED.String()
	default:
		return errorspb.AppErrorCode_INVALID_VALUE.String()
	}
}

var validTagPattern = regexp.MustCompile(`^[a-z0-9-]+$`)

func isValidTag(tag string) bool {
	return len(tag) <= 50 && validTagPattern.MatchString(tag)
}

Error Handler Middleware

Now let’s create middleware to handle errors consistently:

package middleware

import (
	"context"
	"errors"
	"log"

	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"google.golang.org/grpc"
	"google.golang.org/grpc/status"
)

// UnaryErrorInterceptor translates application errors into gRPC statuses.
func UnaryErrorInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
	resp, err := handler(ctx, req)
	if err == nil {
		return resp, nil
	}

	var appErr *apperrors.AppError
	if errors.As(err, &appErr) {
		if appErr.CausedBy != nil {
			log.Printf("ERROR: %s, Original cause: %v", appErr.Title, appErr.CausedBy)
		}
		return nil, appErr.ToGRPCStatus().Err()
	}

	if _, ok := status.FromError(err); ok {
		return nil, err // Already a gRPC status
	}

	log.Printf("UNEXPECTED ERROR: %v", err)
	return nil, apperrors.NewInternal("An unexpected error occurred", "", err).ToGRPCStatus().Err()
}
package middleware

import (
	"context"
	"encoding/json"
	"net/http"
	"runtime/debug"
	"time"

	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"github.com/google/uuid"
	"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
	"go.opentelemetry.io/otel/trace"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/encoding/protojson"
)

// HTTPErrorHandler handles errors for HTTP endpoints
func HTTPErrorHandler(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Add trace ID to context
		traceID := r.Header.Get("X-Trace-ID")
		if traceID == "" {
			traceID = uuid.New().String()
		}
		ctx := context.WithValue(r.Context(), "traceID", traceID)
		r = r.WithContext(ctx)

		// Create response wrapper to intercept errors
		wrapped := &responseWriter{
			ResponseWriter: w,
			request:        r,
			traceID:        traceID,
		}

		// Handle panics
		defer func() {
			if err := recover(); err != nil {
				handlePanic(wrapped, err)
			}
		}()

		// Process request
		next.ServeHTTP(wrapped, r)
	})
}

// responseWriter wraps http.ResponseWriter to intercept errors
type responseWriter struct {
	http.ResponseWriter
	request    *http.Request
	traceID    string
	statusCode int
	written    bool
}

func (w *responseWriter) WriteHeader(code int) {
	if !w.written {
		w.statusCode = code
		w.ResponseWriter.WriteHeader(code)
		w.written = true
	}
}

func (w *responseWriter) Write(b []byte) (int, error) {
	if !w.written {
		w.WriteHeader(http.StatusOK)
	}
	return w.ResponseWriter.Write(b)
}

// handlePanic converts panics to proper error responses
func handlePanic(w *responseWriter, recovered interface{}) {
	// Log stack trace
	debug.PrintStack()

	appErr := apperrors.NewInternal("An unexpected error occurred. Please try again later.", w.traceID, nil)
	writeErrorResponse(w, appErr)
}

// CustomHTTPError handles gRPC gateway error responses
func CustomHTTPError(ctx context.Context, mux *runtime.ServeMux,
	marshaler runtime.Marshaler, w http.ResponseWriter, r *http.Request, err error) {

	// Extract trace ID
	traceID := r.Header.Get("X-Trace-ID")
	if traceID == "" {
		if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
			traceID = span.SpanContext().TraceID().String()
		} else {
			traceID = uuid.New().String()
		}
	}

	// Convert gRPC error to HTTP response
	st, _ := status.FromError(err)

	// Check if we have our custom error detail in status details
	for _, detail := range st.Details() {
		if errorDetail, ok := detail.(*errorspb.ErrorDetail); ok {
			// Update the error detail with current request context
			errorDetail.TraceId = traceID
			errorDetail.Instance = r.URL.Path

			// Convert to JSON and write response
			w.Header().Set("Content-Type", "application/problem+json")
			w.WriteHeader(runtime.HTTPStatusFromCode(st.Code()))

			// Create a simplified JSON response that matches RFC 7807
			response := map[string]interface{}{
				"type":      getTypeForCode(errorDetail.Code),
				"title":     errorDetail.Title,
				"status":    runtime.HTTPStatusFromCode(st.Code()),
				"detail":    errorDetail.Detail,
				"instance":  errorDetail.Instance,
				"traceId":   errorDetail.TraceId,
				"timestamp": errorDetail.Timestamp,
			}

			// Add field violations if present
			if len(errorDetail.FieldViolations) > 0 {
				violations := make([]map[string]interface{}, len(errorDetail.FieldViolations))
				for i, fv := range errorDetail.FieldViolations {
					violations[i] = map[string]interface{}{
						"field":   fv.Field,
						"code":    fv.Code,
						"message": fv.Description,
					}
				}
				response["errors"] = violations
			}

			// Add extensions if present
			if len(errorDetail.Extensions) > 0 {
				extensions := make(map[string]interface{})
				for k, v := range errorDetail.Extensions {
					// Convert Any to JSON
					if jsonBytes, err := protojson.Marshal(v); err == nil {
						var jsonData interface{}
						if err := json.Unmarshal(jsonBytes, &jsonData); err == nil {
							extensions[k] = jsonData
						}
					}
				}
				if len(extensions) > 0 {
					response["extensions"] = extensions
				}
			}

			if err := json.NewEncoder(w).Encode(response); err != nil {
				http.Error(w, `{"error": "Failed to encode error response"}`, 500)
			}
			return
		}
	}

	// Fallback: create new error response
	fallbackErr := apperrors.NewInternal(st.Message(), traceID, nil)
	fallbackErr.GRPCCode = st.Code()
	writeAppErrorResponse(w, fallbackErr, r.URL.Path)
}

// Helper functions
func getTypeForCode(code string) string {
	switch code {
	case errorspb.AppErrorCode_VALIDATION_FAILED.String():
		return "https://api.example.com/errors/validation-failed"
	case errorspb.AppErrorCode_RESOURCE_NOT_FOUND.String():
		return "https://api.example.com/errors/resource-not-found"
	case errorspb.AppErrorCode_RESOURCE_CONFLICT.String():
		return "https://api.example.com/errors/resource-conflict"
	case errorspb.AppErrorCode_PERMISSION_DENIED.String():
		return "https://api.example.com/errors/permission-denied"
	case errorspb.AppErrorCode_INTERNAL_ERROR.String():
		return "https://api.example.com/errors/internal-error"
	case errorspb.AppErrorCode_SERVICE_UNAVAILABLE.String():
		return "https://api.example.com/errors/service-unavailable"
	default:
		return "https://api.example.com/errors/unknown"
	}
}

func writeErrorResponse(w http.ResponseWriter, err error) {
	if appErr, ok := err.(*apperrors.AppError); ok {
		writeAppErrorResponse(w, appErr, "")
	} else {
		http.Error(w, err.Error(), http.StatusInternalServerError)
	}
}

func writeAppErrorResponse(w http.ResponseWriter, appErr *apperrors.AppError, instance string) {
	statusCode := runtime.HTTPStatusFromCode(appErr.GRPCCode)

	response := map[string]interface{}{
		"type":      getTypeForCode(appErr.AppCode.String()),
		"title":     appErr.Title,
		"status":    statusCode,
		"detail":    appErr.Detail,
		"traceId":   appErr.TraceID,
		"timestamp": time.Now(),
	}

	if instance != "" {
		response["instance"] = instance
	}

	if len(appErr.FieldViolations) > 0 {
		violations := make([]map[string]interface{}, len(appErr.FieldViolations))
		for i, fv := range appErr.FieldViolations {
			violations[i] = map[string]interface{}{
				"field":   fv.Field,
				"code":    fv.Code,
				"message": fv.Description,
			}
		}
		response["errors"] = violations
	}

	w.Header().Set("Content-Type", "application/problem+json")
	w.WriteHeader(statusCode)
	json.NewEncoder(w).Encode(response)
}

Service Implementation

Now let’s implement our TODO service with proper error handling:

package service

import (
	"context"
	"fmt"
	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	"github.com/bhatti/todo-api-errors/internal/errors"
	"github.com/bhatti/todo-api-errors/internal/repository"
	"github.com/bhatti/todo-api-errors/internal/validation"
	"github.com/google/uuid"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/trace"
	"google.golang.org/protobuf/types/known/fieldmaskpb"
	"google.golang.org/protobuf/types/known/timestamppb"
	"strings"
)

var tracer = otel.Tracer("todo-service")

// TodoService implements the TODO API
type TodoService struct {
	todopb.UnimplementedTodoServiceServer
	repo repository.TodoRepository
}

// NewTodoService creates a new TODO service
func NewTodoService(repo repository.TodoRepository) (*TodoService, error) {
	return &TodoService{
		repo: repo,
	}, nil
}

// CreateTask creates a new task
func (s *TodoService) CreateTask(ctx context.Context, req *todopb.CreateTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "CreateTask")
	defer span.End()

	// Get trace ID for error responses
	traceID := span.SpanContext().TraceID().String()

	// Validate request
	if req.Task == nil {
		return nil, errors.NewRequiredField("task", "Task object is required", traceID)
	}

	// Validate task fields using the new validation package
	if err := validation.ValidateTask(req.Task, traceID); err != nil {
		span.SetAttributes(attribute.String("validation.error", err.Error()))
		return nil, err
	}

	// Check for duplicate title
	existing, err := s.repo.GetTaskByTitle(ctx, req.Task.Title)
	if err != nil && !repository.IsNotFound(err) {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	if existing != nil {
		return nil, errors.NewConflict("task", "A task with this title already exists", traceID)
	}

	// Generate task ID
	taskID := uuid.New().String()
	task := &todopb.Task{
		Name:        fmt.Sprintf("tasks/%s", taskID),
		Title:       req.Task.Title,
		Description: req.Task.Description,
		Status:      req.Task.Status,
		Priority:    req.Task.Priority,
		DueDate:     req.Task.DueDate,
		Tags:        req.Task.Tags,
		CreateTime:  timestamppb.Now(),
		UpdateTime:  timestamppb.Now(),
		CreatedBy:   s.getUserFromContext(ctx),
	}

	// Set defaults
	if task.Status == todopb.Status_STATUS_UNSPECIFIED {
		task.Status = todopb.Status_STATUS_PENDING
	}
	if task.Priority == todopb.Priority_PRIORITY_UNSPECIFIED {
		task.Priority = todopb.Priority_PRIORITY_MEDIUM
	}

	// Save to repository
	if err := s.repo.CreateTask(ctx, task); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	span.SetAttributes(
		attribute.String("task.id", taskID),
		attribute.String("task.title", task.Title),
	)

	return task, nil
}

// GetTask retrieves a specific task
func (s *TodoService) GetTask(ctx context.Context, req *todopb.GetTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "GetTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Extract task ID
	parts := strings.Split(req.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("name", "Task name must be in format 'tasks/{id}'", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get from repository
	task, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canAccessTask(ctx, task) {
		return nil, errors.NewPermissionDenied("task", "read", traceID)
	}

	return task, nil
}

// ListTasks retrieves all tasks
func (s *TodoService) ListTasks(ctx context.Context, req *todopb.ListTasksRequest) (*todopb.ListTasksResponse, error) {
	ctx, span := tracer.Start(ctx, "ListTasks")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Default page size
	pageSize := req.PageSize
	if pageSize == 0 {
		pageSize = 50
	}
	if pageSize > 1000 {
		pageSize = 1000
	}

	span.SetAttributes(
		attribute.Int("page.size", int(pageSize)),
		attribute.String("filter", req.Filter),
	)

	// Parse filter
	filter, err := s.parseFilter(req.Filter)
	if err != nil {
		return nil, errors.NewRequiredField("filter", fmt.Sprintf("Failed to parse filter: %v", err), traceID)
	}

	// Get tasks from repository
	tasks, nextPageToken, err := s.repo.ListTasks(ctx, repository.ListOptions{
		PageSize:  int(pageSize),
		PageToken: req.PageToken,
		Filter:    filter,
		OrderBy:   req.OrderBy,
		UserID:    s.getUserFromContext(ctx),
	})

	if err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Get total count
	totalSize, err := s.repo.CountTasks(ctx, filter, s.getUserFromContext(ctx))
	if err != nil {
		// Log but don't fail the request
		span.RecordError(err)
		totalSize = -1
	}

	return &todopb.ListTasksResponse{
		Tasks:         tasks,
		NextPageToken: nextPageToken,
		TotalSize:     int32(totalSize),
	}, nil
}

// UpdateTask updates an existing task
func (s *TodoService) UpdateTask(ctx context.Context, req *todopb.UpdateTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "UpdateTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request
	if req.Task == nil {
		return nil, errors.NewRequiredField("task", "Task object is required", traceID)
	}

	if req.UpdateMask == nil || len(req.UpdateMask.Paths) == 0 {
		return nil, errors.NewRequiredField("update_mask", "Update mask must specify which fields to update", traceID)
	}

	// Extract task ID
	parts := strings.Split(req.Task.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("task.name", "Invalid task name format", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get existing task
	existing, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canModifyTask(ctx, existing) {
		return nil, errors.NewPermissionDenied("task", "update", traceID)
	}

	// Apply updates based on field mask
	updated := s.applyFieldMask(existing, req.Task, req.UpdateMask)
	updated.UpdateTime = timestamppb.Now()

	// Validate updated task using the new validation package
	if err := validation.ValidateTask(updated, traceID); err != nil {
		return nil, err
	}

	// Save to repository
	if err := s.repo.UpdateTask(ctx, updated); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	return updated, nil
}

// DeleteTask removes a task
func (s *TodoService) DeleteTask(ctx context.Context, req *todopb.DeleteTaskRequest) (*todopb.DeleteTaskResponse, error) {
	ctx, span := tracer.Start(ctx, "DeleteTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Extract task ID
	parts := strings.Split(req.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("name", "Invalid task name format", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get existing task to check permissions
	existing, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canModifyTask(ctx, existing) {
		return nil, errors.NewPermissionDenied("task", "delete", traceID)
	}

	// Delete from repository
	if err := s.repo.DeleteTask(ctx, taskID); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	return &todopb.DeleteTaskResponse{
		Message: fmt.Sprintf("Task %s deleted successfully", req.Name),
	}, nil
}

// BatchCreateTasks creates multiple tasks at once
func (s *TodoService) BatchCreateTasks(ctx context.Context, req *todopb.BatchCreateTasksRequest) (*todopb.BatchCreateTasksResponse, error) {
	ctx, span := tracer.Start(ctx, "BatchCreateTasks")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate batch request using the new validation package
	if err := validation.ValidateBatchCreateTasks(req, traceID); err != nil {
		span.SetAttributes(attribute.String("validation.error", err.Error()))
		return nil, err
	}

	// Process each task
	var created []*todopb.Task
	var batchErrors []string

	for i, createReq := range req.Requests {
		task, err := s.CreateTask(ctx, createReq)
		if err != nil {
			// Collect errors for batch response
			batchErrors = append(batchErrors, fmt.Sprintf("Task %d: %s", i, err.Error()))
			continue
		}
		created = append(created, task)
	}

	// If all tasks failed, return error
	if len(created) == 0 && len(batchErrors) > 0 {
		return nil, errors.NewInternal("All batch operations failed", traceID, nil)
	}

	// Return partial success
	response := &todopb.BatchCreateTasksResponse{
		Tasks: created,
	}

	// Add partial errors to response metadata if any
	if len(batchErrors) > 0 {
		span.SetAttributes(
			attribute.Int("batch.total", len(req.Requests)),
			attribute.Int("batch.success", len(created)),
			attribute.Int("batch.failed", len(batchErrors)),
		)
	}

	return response, nil
}

// Helper methods

func (s *TodoService) handleRepositoryError(err error, traceID string) error {
	if repository.IsConnectionError(err) {
		return errors.NewServiceUnavailable("Unable to connect to the database. Please try again later.", traceID)
	}

	// Log internal error details
	span := trace.SpanFromContext(context.Background())
	if span != nil {
		span.RecordError(err)
	}

	return errors.NewInternal("An unexpected error occurred while processing your request", traceID, err)
}

func (s *TodoService) getUserFromContext(ctx context.Context) string {
	// In a real implementation, this would extract user info from auth context
	if user, ok := ctx.Value("user").(string); ok {
		return user
	}
	return "anonymous"
}

func (s *TodoService) canAccessTask(ctx context.Context, task *todopb.Task) bool {
	// In a real implementation, check if user can access this task
	user := s.getUserFromContext(ctx)
	return user == task.CreatedBy || user == "admin"
}

func (s *TodoService) canModifyTask(ctx context.Context, task *todopb.Task) bool {
	// In a real implementation, check if user can modify this task
	user := s.getUserFromContext(ctx)
	return user == task.CreatedBy || user == "admin"
}

func (s *TodoService) parseFilter(filter string) (map[string]interface{}, error) {
	// Simple filter parser - in production, use a proper parser
	parsed := make(map[string]interface{})

	if filter == "" {
		return parsed, nil
	}

	// Example: "status=COMPLETED AND priority=HIGH"
	parts := strings.Split(filter, " AND ")
	for _, part := range parts {
		kv := strings.Split(strings.TrimSpace(part), "=")
		if len(kv) != 2 {
			return nil, fmt.Errorf("invalid filter expression: %s", part)
		}

		key := strings.TrimSpace(kv[0])
		value := strings.Trim(strings.TrimSpace(kv[1]), "'\"")

		// Validate filter keys
		switch key {
		case "status", "priority", "created_by":
			parsed[key] = value
		default:
			return nil, fmt.Errorf("unknown filter field: %s", key)
		}
	}

	return parsed, nil
}

func (s *TodoService) applyFieldMask(existing, update *todopb.Task, mask *fieldmaskpb.FieldMask) *todopb.Task {
	result := *existing

	for _, path := range mask.Paths {
		switch path {
		case "title":
			result.Title = update.Title
		case "description":
			result.Description = update.Description
		case "status":
			result.Status = update.Status
		case "priority":
			result.Priority = update.Priority
		case "due_date":
			result.DueDate = update.DueDate
		case "tags":
			result.Tags = update.Tags
		}
	}
	return &result
}

Server Implementation

Now let’s put it all together in our server:

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	"github.com/bhatti/todo-api-errors/internal/middleware"
	"github.com/bhatti/todo-api-errors/internal/monitoring"
	"github.com/bhatti/todo-api-errors/internal/repository"
	"github.com/bhatti/todo-api-errors/internal/service"

	"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/credentials/insecure"
	"google.golang.org/grpc/reflection"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/encoding/protojson"
)

func main() {
	// Initialize monitoring
	if err := monitoring.InitOpenTelemetryMetrics(); err != nil {
		log.Printf("Failed to initialize OpenTelemetry metrics: %v", err)
		// Continue without OpenTelemetry - Prometheus will still work
	}

	// Initialize repository
	repo := repository.NewInMemoryRepository()

	// Initialize service
	todoService, err := service.NewTodoService(repo)
	if err != nil {
		log.Fatalf("Failed to create service: %v", err)
	}

	// Start gRPC server
	grpcPort := ":50051"
	go func() {
		if err := startGRPCServer(grpcPort, todoService); err != nil {
			log.Fatalf("Failed to start gRPC server: %v", err)
		}
	}()

	// Start HTTP gateway
	httpPort := ":8080"
	go func() {
		if err := startHTTPGateway(httpPort, grpcPort); err != nil {
			log.Fatalf("Failed to start HTTP gateway: %v", err)
		}
	}()

	// Start metrics server
	go func() {
		http.Handle("/metrics", promhttp.Handler())
		if err := http.ListenAndServe(":9090", nil); err != nil {
			log.Printf("Failed to start metrics server: %v", err)
		}
	}()

	log.Printf("TODO API server started")
	log.Printf("gRPC server listening on %s", grpcPort)
	log.Printf("HTTP gateway listening on %s", httpPort)
	log.Printf("Metrics available at :9090/metrics")

	// Wait for interrupt signal
	sigCh := make(chan os.Signal, 1)
	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
	<-sigCh

	log.Println("Shutting down...")
}

func startGRPCServer(port string, todoService todopb.TodoServiceServer) error {
	lis, err := net.Listen("tcp", port)
	if err != nil {
		return fmt.Errorf("failed to listen: %w", err)
	}

	// Create gRPC server with interceptors - now using the new UnaryErrorInterceptor
	opts := []grpc.ServerOption{
		grpc.ChainUnaryInterceptor(
			middleware.UnaryErrorInterceptor, // Using new protobuf-based error interceptor
			loggingInterceptor(),
			recoveryInterceptor(),
		),
	}

	server := grpc.NewServer(opts...)

	// Register service
	todopb.RegisterTodoServiceServer(server, todoService)

	// Register reflection for debugging
	reflection.Register(server)

	return server.Serve(lis)
}

func startHTTPGateway(httpPort, grpcPort string) error {
	ctx := context.Background()

	// Create gRPC connection
	conn, err := grpc.DialContext(
		ctx,
		"localhost"+grpcPort,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)
	if err != nil {
		return fmt.Errorf("failed to dial gRPC server: %w", err)
	}

	// Create gateway mux with custom error handler
	mux := runtime.NewServeMux(
		runtime.WithErrorHandler(middleware.CustomHTTPError), // Using new protobuf-based error handler
		runtime.WithMarshalerOption(runtime.MIMEWildcard, &runtime.JSONPb{
			MarshalOptions: protojson.MarshalOptions{
				UseProtoNames:   true,
				EmitUnpopulated: false,
			},
			UnmarshalOptions: protojson.UnmarshalOptions{
				DiscardUnknown: true,
			},
		}),
	)

	// Register service handler
	if err := todopb.RegisterTodoServiceHandler(ctx, mux, conn); err != nil {
		return fmt.Errorf("failed to register service handler: %w", err)
	}

	// Create HTTP server with middleware
	handler := middleware.HTTPErrorHandler( // Using new protobuf-based HTTP error handler
		corsMiddleware(
			authMiddleware(
				loggingHTTPMiddleware(mux),
			),
		),
	)

	server := &http.Server{
		Addr:         httpPort,
		Handler:      handler,
		ReadTimeout:  10 * time.Second,
		WriteTimeout: 10 * time.Second,
		IdleTimeout:  120 * time.Second,
	}

	return server.ListenAndServe()
}

// Middleware implementations

func loggingInterceptor() grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
		start := time.Now()

		// Call handler
		resp, err := handler(ctx, req)

		// Log request
		duration := time.Since(start)
		statusCode := "OK"
		if err != nil {
			statusCode = status.Code(err).String()
		}

		log.Printf("gRPC: %s %s %s %v", info.FullMethod, statusCode, duration, err)

		return resp, err
	}
}

func recoveryInterceptor() grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) {
		defer func() {
			if r := recover(); r != nil {
				log.Printf("Recovered from panic: %v", r)
				monitoring.RecordPanicRecovery(ctx)
				err = status.Error(codes.Internal, "Internal server error")
			}
		}()

		return handler(ctx, req)
	}
}

func loggingHTTPMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()

		// Wrap response writer to capture status
		wrapped := &statusResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}

		// Process request
		next.ServeHTTP(wrapped, r)

		// Log request
		duration := time.Since(start)
		log.Printf("HTTP: %s %s %d %v", r.Method, r.URL.Path, wrapped.statusCode, duration)
	})
}

func corsMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Access-Control-Allow-Origin", "*")
		w.Header().Set("Access-Control-Allow-Methods", "GET, POST, PUT, DELETE, OPTIONS, PATCH")
		w.Header().Set("Access-Control-Allow-Headers", "Content-Type, Authorization, X-Trace-ID")

		if r.Method == "OPTIONS" {
			w.WriteHeader(http.StatusOK)
			return
		}

		next.ServeHTTP(w, r)
	})
}

func authMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Simple auth for demo - in production use proper authentication
		authHeader := r.Header.Get("Authorization")
		if authHeader == "" {
			authHeader = "Bearer anonymous"
		}

		// Extract user from token
		user := "anonymous"
		if len(authHeader) > 7 && authHeader[:7] == "Bearer " {
			user = authHeader[7:]
		}

		// Add user to context
		ctx := context.WithValue(r.Context(), "user", user)
		next.ServeHTTP(w, r.WithContext(ctx))
	})
}

type statusResponseWriter struct {
	http.ResponseWriter
	statusCode int
}

func (w *statusResponseWriter) WriteHeader(code int) {
	w.statusCode = code
	w.ResponseWriter.WriteHeader(code)
}

Example API Usage

Let’s see our error handling in action with some example requests:

Example 1: Validation Error with Multiple Issues

Request with multiple validation errors

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "",
"description": "This description is wayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy too long…",
"status": "INVALID_STATUS",
"tags": ["INVALID TAG", "tag-1", "tag-1"]
}
}'

Response

< HTTP/1.1 422 Unprocessable Entity
< Content-Type: application/problem+json
{
  "detail": "The request contains 5 validation errors",
  "errors": [
    {
      "code": "TOO_SHORT",
      "field": "title",
      "message": "value length must be at least 1 characters"
    },
    {
      "code": "TOO_LONG",
      "field": "description",
      "message": "value length must be at most 100 characters"
    },
    {
      "code": "INVALID_FORMAT",
      "field": "tags",
      "message": "value does not match regex pattern `^[a-z0-9-]+$`"
    },
    {
      "code": "INVALID_TAG_FORMAT",
      "field": "tags[0]",
      "message": "Tag 'INVALID TAG' must be lowercase letters, numbers, and hyphens only"
    },
    {
      "code": "DUPLICATE_TAG",
      "field": "tags[2]",
      "message": "Tag 'tag-1' appears multiple times"
    }
  ],
  "instance": "/v1/tasks",
  "status": 400,
  "timestamp": {
    "seconds": 1755288524,
    "nanos": 484865000
  },
  "title": "Validation Failed",
  "traceId": "eb4bfb3f-9397-4547-8618-ce9952a16067",
  "type": "https://api.example.com/errors/validation-failed"
}

Example 2: Not Found Error

Request for non-existent task

curl http://localhost:8080/v1/tasks/non-existent-id

Response

< HTTP/1.1 404 Not Found
< Content-Type: application/problem+json
{
  "detail": "Task with ID 'non-existent-id' was not found.",
  "instance": "/v1/tasks/non-existent-id",
  "status": 404,
  "timestamp": {
    "seconds": 1755288565,
    "nanos": 904607000
  },
  "title": "Resource Not Found",
  "traceId": "6ce00cd8-d0b7-47f1-b6f6-9fc1375c26a4",
  "type": "https://api.example.com/errors/resource-not-found"
}

Example 3: Conflict Error

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "Existing Task Title"
}
}'

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "Existing Task Title"
}
}'

Response

< HTTP/1.1 409 Conflict
< Content-Type: application/problem+json
{
  "detail": "Conflict creating task: A task with this title already exists",
  "instance": "/v1/tasks",
  "status": 409,
  "timestamp": {
    "seconds": 1755288593,
    "nanos": 594458000
  },
  "title": "Resource Conflict",
  "traceId": "ed2e78d2-591d-492a-8d71-6b6843ce86f7",
  "type": "https://api.example.com/errors/resource-conflict"
}

Example 4: Service Unavailable (Transient Error)

When database is down

curl http://localhost:8080/v1/tasks

Response

HTTP/1.1 503 Service Unavailable
Content-Type: application/problem+json
Retry-After: 30
{
  "type": "https://api.example.com/errors/service-unavailable",
  "title": "Service Unavailable",
  "status": 503,
  "detail": "Database connection pool exhausted. Please try again later.",
  "instance": "/v1/tasks",
  "traceId": "db-pool-001",
  "timestamp": "2025-08-15T10:30:00Z",
  "extensions": {
    "retryable": true,
    "retryAfter": "2025-08-15T10:30:30Z",
    "maxRetries": 3,
    "backoffType": "exponential",
    "backoffMs": 1000,
    "errorCategory": "database"
  }
}

Best Practices Summary

Our implementation demonstrates several key best practices:

1. Consistent Error Format

All errors follow RFC 9457 (Problem Details) format, providing:

  • Machine-readable type URIs
  • Human-readable titles and details
  • HTTP status codes
  • Request tracing
  • Extensible metadata

2. Comprehensive Validation

  • All validation errors are returned at once, not one by one
  • Clear field paths for nested objects
  • Descriptive error codes and messages
  • Support for batch operations with partial success

3. Security-Conscious Design

  • No sensitive information in error messages
  • Internal errors are logged but not exposed
  • Generic messages for authentication failures
  • Request IDs for support without exposing internals

4. Developer Experience

  • Clear, actionable error messages
  • Helpful suggestions for fixing issues
  • Consistent error codes across protocols
  • Rich metadata for debugging

5. Protocol Compatibility

  • Seamless translation between gRPC and HTTP
  • Proper status code mapping
  • Preservation of error details across protocols

6. Observability

  • Structured logging with trace IDs
  • Prometheus metrics for monitoring
  • OpenTelemetry integration
  • Error categorization for analysis

Conclusion

This comprehensive guide demonstrates how to build robust error handling for modern APIs. By treating errors as a first-class feature of our API, we’ve achieved several key benefits:

  • Consistency: All errors, regardless of their source, are presented to clients in a predictable format.
  • Clarity: Developers consuming our API get clear, actionable feedback, helping them debug and integrate faster.
  • Developer Ergonomics: Our internal service code is cleaner, as handlers focus on business logic while the middleware handles the boilerplate of error conversion.
  • Security: We have a clear separation between internal error details (for logging) and public error responses, preventing leaks.

Additional Resources

You can find the full source code for this example in this GitHub repository.

July 17, 2025

Zero-Downtime Services with Lifecycle Management on Kubernetes and Istio

Filed under: Computing,Web Services — admin @ 3:12 pm

Introduction

In the world of cloud-native applications, service lifecycle management is often an afterthought—until it causes a production outage. Whether you’re running gRPC or REST APIs on Kubernetes with Istio, proper lifecycle management is the difference between smooth deployments and 3 AM incident calls. Consider these scenarios:

  • Your service takes 45 seconds to warm up its cache, but Kubernetes kills it after 30 seconds of startup wait.
  • During deployments, clients receive connection errors as pods terminate abruptly.
  • A hiccup in a database or dependent service causes your entire service mesh to cascade fail.
  • Your service mesh sidecar shuts down before your application is terminated or drops in-flight requests.
  • A critical service receives SIGKILL during transaction processing, leaving data in inconsistent states.
  • After a regional outage, services restart but data drift goes undetected for hours.
  • Your RTO target is 15 seconds, but services take 30 seconds just to start up properly.

These aren’t edge cases—they’re common problems that proper lifecycle management solves. More critically, unsafe shutdowns can cause data corruption, financial losses, and breach compliance requirements. This guide covers what you need to know about building services that start safely, shut down gracefully, and handle failures intelligently.

The Hidden Complexity of Service Lifecycles

Modern microservices don’t exist in isolation. A typical request might flow through:

Typical Request Flow.

Each layer adds complexity to startup and shutdown sequences. Without proper coordination, you’ll experience:

  • Startup race conditions: Application tries to make network calls before the sidecar proxy is ready
  • Shutdown race conditions: Sidecar terminates while the application is still processing requests
  • Premature traffic: Load balancer routes traffic before the application is truly ready
  • Dropped connections: Abrupt shutdowns leave clients hanging
  • Data corruption: In-flight transactions get interrupted, leaving databases in inconsistent states
  • Compliance violations: Financial services may face regulatory penalties for data integrity failures

Core Concepts: The Three Types of Health Checks

Kubernetes provides three distinct probe types, each serving a specific purpose:

1. Liveness Probe: “Is the process alive?”

  • Detects deadlocks and unrecoverable states
  • Should be fast and simple (e.g., HTTP GET /healthz)
  • Failure triggers container restart
  • Common mistake: Making this check too complex

2. Readiness Probe: “Can the service handle traffic?”

  • Validates all critical dependencies are available
  • Prevents routing traffic to pods that aren’t ready
  • Should perform “deep” checks of dependencies
  • Common mistake: Using the same check as liveness

3. Startup Probe: “Is the application still initializing?”

  • Provides grace period for slow-starting containers
  • Disables liveness/readiness probes until successful
  • Prevents restart loops during initialization
  • Common mistake: Not using it for slow-starting apps

The Hidden Dangers of Unsafe Shutdowns

While graceful shutdown is ideal, it’s not always possible. Kubernetes will send SIGKILL after the termination grace period, and infrastructure failures can terminate pods instantly. This creates serious risks:

Data Corruption Scenarios

Financial Transaction Example:

// DANGEROUS: Non-atomic operation
func (s *PaymentService) ProcessPayment(req *PaymentRequest) error {
    // Step 1: Debit source account
    if err := s.debitAccount(req.FromAccount, req.Amount); err != nil {
        return err
    }
    
    // ???? SIGKILL here leaves money debited but not credited
    // Step 2: Credit destination account  
    if err := s.creditAccount(req.ToAccount, req.Amount); err != nil {
        // Money is lost! Source debited but destination not credited
        return err
    }
    
    // Step 3: Record transaction
    return s.recordTransaction(req)
}

E-commerce Inventory Example:

// DANGEROUS: Race condition during shutdown
func (s *InventoryService) ReserveItem(req *ReserveRequest) error {
    // Check availability
    if s.getStock(req.ItemID) < req.Quantity {
        return ErrInsufficientStock
    }
    
    // ???? SIGKILL here can cause double-reservation
    // Another request might see the same stock level
    
    // Reserve the item
    return s.updateStock(req.ItemID, -req.Quantity)
}

RTO/RPO Impact

Recovery Time Objective (RTO): How quickly can we restore service?

  • Poor lifecycle management increases startup time
  • Services may need manual intervention to reach consistent state
  • Cascading failures extend recovery time across the entire system

Recovery Point Objective (RPO): How much data can we afford to lose?

  • Unsafe shutdowns can corrupt recent transactions
  • Without idempotency, replay of messages may create duplicates
  • Data inconsistencies may not be detected until much later

The Anti-Entropy Solution

Since graceful shutdown isn’t always possible, production systems need reconciliation processes to detect and repair inconsistencies:

// Anti-entropy pattern for data consistency
type ReconciliationService struct {
    paymentDB    PaymentDatabase
    accountDB    AccountDatabase
    auditLog     AuditLogger
    alerting     AlertingService
}

func (r *ReconciliationService) ReconcilePayments(ctx context.Context) error {
    // Find payments without matching account entries
    orphanedPayments, err := r.paymentDB.FindOrphanedPayments(ctx)
    if err != nil {
        return err
    }
    
    for _, payment := range orphanedPayments {
        // Check if this was a partial transaction
        sourceDebit, _ := r.accountDB.GetTransaction(payment.FromAccount, payment.ID)
        destCredit, _ := r.accountDB.GetTransaction(payment.ToAccount, payment.ID)
        
        switch {
        case sourceDebit != nil && destCredit == nil:
            // Complete the transaction
            if err := r.creditAccount(payment.ToAccount, payment.Amount); err != nil {
                r.alerting.SendAlert("Failed to complete orphaned payment", payment.ID)
                continue
            }
            r.auditLog.RecordReconciliation("completed_payment", payment.ID)
            
        case sourceDebit == nil && destCredit != nil:
            // Reverse the credit
            if err := r.debitAccount(payment.ToAccount, payment.Amount); err != nil {
                r.alerting.SendAlert("Failed to reverse orphaned credit", payment.ID)
                continue
            }
            r.auditLog.RecordReconciliation("reversed_credit", payment.ID)
            
        default:
            // Both or neither exist - needs investigation
            r.alerting.SendAlert("Ambiguous payment state", payment.ID)
        }
    }
    
    return nil
}

// Run reconciliation periodically
func (r *ReconciliationService) Start(ctx context.Context) {
    ticker := time.NewTicker(5 * time.Minute)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            if err := r.ReconcilePayments(ctx); err != nil {
                log.Printf("Reconciliation failed: %v", err)
            }
        }
    }
}

Building a Resilient Service: Complete Example

Let’s build a production-ready service that demonstrates all best practices. We’ll create two versions: one with anti-patterns (bad-service) and one with best practices (good-service).

Sequence diagram of a typical API with proper Kubernetes and Istio configuration.

The Application Code

//go:generate protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative api/demo.proto

package main

import (
    "context"
    "flag"
    "fmt"
    "log"
    "net"
    "net/http"
    "os"
    "os/signal"
    "sync/atomic"
    "syscall"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    health "google.golang.org/grpc/health/grpc_health_v1"
    "google.golang.org/grpc/status"
)

// Service represents our application with health state
type Service struct {
    isHealthy         atomic.Bool
    isShuttingDown    atomic.Bool
    activeRequests    atomic.Int64
    dependencyHealthy atomic.Bool
}

// HealthChecker implements the gRPC health checking protocol
type HealthChecker struct {
    svc *Service
}

func (h *HealthChecker) Check(ctx context.Context, req *health.HealthCheckRequest) (*health.HealthCheckResponse, error) {
    service := req.GetService()
    
    // Liveness: Simple check - is the process responsive?
    if service == "" || service == "liveness" {
        if h.svc.isShuttingDown.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    // Readiness: Deep check - can we handle traffic?
    if service == "readiness" {
        // Check application health
        if !h.svc.isHealthy.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        // Check critical dependencies
        if !h.svc.dependencyHealthy.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        // Check if shutting down
        if h.svc.isShuttingDown.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    // Synthetic readiness: Complex business logic check for monitoring
    if service == "synthetic-readiness" {
        // Simulate a complex health check that validates business logic
        // This would make actual API calls, database queries, etc.
        if !h.performSyntheticCheck(ctx) {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    return nil, status.Errorf(codes.NotFound, "unknown service: %s", service)
}

func (h *HealthChecker) performSyntheticCheck(ctx context.Context) bool {
    // In a real service, this would:
    // 1. Create a test transaction
    // 2. Query the database
    // 3. Call dependent services
    // 4. Validate the complete flow works
    return h.svc.isHealthy.Load() && h.svc.dependencyHealthy.Load()
}

func (h *HealthChecker) Watch(req *health.HealthCheckRequest, server health.Health_WatchServer) error {
    return status.Error(codes.Unimplemented, "watch not implemented")
}

// DemoServiceServer implements your business logic
type DemoServiceServer struct {
    UnimplementedDemoServiceServer
    svc *Service
}

func (s *DemoServiceServer) ProcessRequest(ctx context.Context, req *ProcessRequest) (*ProcessResponse, error) {
    s.svc.activeRequests.Add(1)
    defer s.svc.activeRequests.Add(-1)
    
    // Simulate processing
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    case <-time.After(100 * time.Millisecond):
        return &ProcessResponse{
            Result: fmt.Sprintf("Processed: %s", req.GetData()),
        }, nil
    }
}

func main() {
    var (
        port         = flag.Int("port", 8080, "gRPC port")
        mgmtPort     = flag.Int("mgmt-port", 8090, "Management port")
        startupDelay = flag.Duration("startup-delay", 10*time.Second, "Startup delay")
    )
    flag.Parse()
    
    svc := &Service{}
    svc.dependencyHealthy.Store(true) // Assume healthy initially
    
    // Management endpoints for testing
    mux := http.NewServeMux()
    mux.HandleFunc("/toggle-health", func(w http.ResponseWriter, r *http.Request) {
        current := svc.dependencyHealthy.Load()
        svc.dependencyHealthy.Store(!current)
        fmt.Fprintf(w, "Dependency health toggled to: %v\n", !current)
    })
    mux.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "active_requests %d\n", svc.activeRequests.Load())
        fmt.Fprintf(w, "is_healthy %v\n", svc.isHealthy.Load())
        fmt.Fprintf(w, "is_shutting_down %v\n", svc.isShuttingDown.Load())
    })
    
    mgmtServer := &http.Server{
        Addr:    fmt.Sprintf(":%d", *mgmtPort),
        Handler: mux,
    }
    
    // Start management server
    go func() {
        log.Printf("Management server listening on :%d", *mgmtPort)
        if err := mgmtServer.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("Management server failed: %v", err)
        }
    }()
    
    // Simulate slow startup
    log.Printf("Starting application (startup delay: %v)...", *startupDelay)
    time.Sleep(*startupDelay)
    svc.isHealthy.Store(true)
    log.Println("Application initialized and ready")
    
    // Setup gRPC server
    lis, err := net.Listen("tcp", fmt.Sprintf(":%d", *port))
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }
    
    grpcServer := grpc.NewServer()
    RegisterDemoServiceServer(grpcServer, &DemoServiceServer{svc: svc})
    health.RegisterHealthServer(grpcServer, &HealthChecker{svc: svc})
    
    // Start gRPC server
    go func() {
        log.Printf("gRPC server listening on :%d", *port)
        if err := grpcServer.Serve(lis); err != nil {
            log.Fatalf("gRPC server failed: %v", err)
        }
    }()
    
    // Wait for shutdown signal
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
    sig := <-sigCh
    
    log.Printf("Received signal: %v, starting graceful shutdown...", sig)
    
    // Graceful shutdown sequence
    svc.isShuttingDown.Store(true)
    svc.isHealthy.Store(false) // Fail readiness immediately
    
    // Stop accepting new requests
    grpcServer.GracefulStop()
    
    // Wait for active requests to complete
    timeout := time.After(30 * time.Second)
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()
    
    for {
        select {
        case <-timeout:
            log.Println("Shutdown timeout reached, forcing exit")
            os.Exit(1)
        case <-ticker.C:
            active := svc.activeRequests.Load()
            if active == 0 {
                log.Println("All requests completed")
                goto shutdown
            }
            log.Printf("Waiting for %d active requests to complete...", active)
        }
    }
    
shutdown:
    // Cleanup
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    mgmtServer.Shutdown(ctx)
    
    log.Println("Graceful shutdown complete")
}

Kubernetes Manifests: Anti-Patterns vs Best Practices

Bad Service (Anti-Patterns)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bad-service
  namespace: demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: bad-service
  template:
    metadata:
      labels:
        app: bad-service
      # MISSING: Critical Istio annotations!
    spec:
      # DEFAULT: Only 30s grace period
      containers:
      - name: app
        image: myregistry/demo-service:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: mgmt
        args: ["--startup-delay=45s"]  # Longer than default probe timeout!
        
        # ANTI-PATTERN: Identical liveness and readiness probes
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080"]
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3  # Will fail after 40s total
          
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080"]  # Same as liveness!
          initialDelaySeconds: 10
          periodSeconds: 10
        
        # MISSING: No startup probe for slow initialization
        # MISSING: No preStop hook for graceful shutdown

Good Service (Best Practices)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: good-service
  namespace: demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: good-service
  template:
    metadata:
      labels:
        app: good-service
      annotations:
        # Critical for Istio/Envoy sidecar lifecycle management
        sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
        proxy.istio.io/config: |
          proxyMetadata:
            EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"
        sidecar.istio.io/proxyCPU: "100m"
        sidecar.istio.io/proxyMemory: "128Mi"
    spec:
      # Extended grace period: preStop (15s) + app shutdown (30s) + buffer (20s)
      terminationGracePeriodSeconds: 65
      
      containers:
      - name: app
        image: myregistry/demo-service:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: mgmt
        args: ["--startup-delay=45s"]
        
        # Resource management for predictable performance
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        
        # Startup probe for slow initialization
        startupProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=readiness"]
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 24  # 5s * 24 = 120s total startup time
          successThreshold: 1
        
        # Simple liveness check
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=liveness"]
          initialDelaySeconds: 0  # Startup probe handles initialization
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        
        # Deep readiness check
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=readiness"]
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 2
          successThreshold: 1
          timeoutSeconds: 5
        
        # Graceful shutdown coordination
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Allow LB to drain
        
        # Environment variables for cloud provider integration
        env:
        - name: CLOUD_PROVIDER
          value: "auto-detect"  # Works with GCP, AWS, Azure
        - name: ENABLE_PROFILING
          value: "true"

Istio Service Mesh: Beyond Basic Lifecycle Management

While proper health checks and graceful shutdown are foundational, Istio adds critical production-grade capabilities that dramatically improve fault tolerance:

Automatic Retries and Circuit Breaking

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
  namespace: demo
spec:
  host: payment-service.demo.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 2
    circuitBreaker:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    retryPolicy:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,gateway-error,connect-failure,refused-stream
      retryRemoteLocalities: true

Key Benefits for Production Systems

  1. Automatic Request Retries: If a pod fails or becomes unavailable, Istio automatically retries requests to healthy instances
  2. Circuit Breaking: Prevents cascading failures by temporarily cutting off traffic to unhealthy services
  3. Load Balancing: Distributes traffic intelligently across healthy pods
  4. Mutual TLS: Secures service-to-service communication without code changes
  5. Observability: Provides detailed metrics, tracing, and logging for all inter-service communication
  6. Canary Deployments: Enables safe rollouts with automatic traffic shifting
  7. Rate Limiting: Protects services from being overwhelmed
  8. Timeout Management: Prevents hanging requests with configurable timeouts

Termination Grace Period Calculation

The critical formula for calculating termination grace periods:

terminationGracePeriodSeconds = preStop delay + application shutdown timeout + buffer

Examples:
- Simple service: 10s + 20s + 5s = 35s
- Complex service: 15s + 45s + 5s = 65s
- Batch processor: 30s + 120s + 10s = 160s

Important: Services requiring more than 90-120 seconds to shut down should be re-architected using checkpoint-and-resume patterns.

Advanced Patterns for Production

1. Idempotency: Handling Duplicate Requests

Critical for production: When pods restart or network issues occur, clients may retry requests. Without idempotency, this can cause duplicate transactions, corrupted state, or financial losses. This is mandatory for all state-modifying operations.

package idempotency

import (
    "context"
    "crypto/sha256"
    "encoding/hex"
    "time"
    "sync"
    "errors"
)

var (
    ErrDuplicateRequest = errors.New("duplicate request detected")
    ErrProcessingInProgress = errors.New("request is currently being processed")
)

// IdempotencyStore tracks request execution with persistence
type IdempotencyStore struct {
    mu        sync.RWMutex
    records   map[string]*Record
    persister PersistenceLayer // Database or Redis for durability
}

type Record struct {
    Key         string
    Response    interface{}
    Error       error
    Status      ProcessingStatus
    ExpiresAt   time.Time
    CreatedAt   time.Time
    ProcessedAt *time.Time
}

type ProcessingStatus int

const (
    StatusPending ProcessingStatus = iota
    StatusProcessing
    StatusCompleted
    StatusFailed
)

// ProcessIdempotent ensures exactly-once processing semantics
func (s *IdempotencyStore) ProcessIdempotent(
    ctx context.Context,
    key string,
    ttl time.Duration,
    fn func() (interface{}, error),
) (interface{}, error) {
    // Check if we've seen this request before
    s.mu.RLock()
    record, exists := s.records[key]
    s.mu.RUnlock()
    
    if exists {
        switch record.Status {
        case StatusCompleted:
            if time.Now().Before(record.ExpiresAt) {
                return record.Response, record.Error
            }
        case StatusProcessing:
            return nil, ErrProcessingInProgress
        case StatusFailed:
            if time.Now().Before(record.ExpiresAt) {
                return record.Response, record.Error
            }
        }
    }
    
    // Mark as processing
    record = &Record{
        Key:       key,
        Status:    StatusProcessing,
        ExpiresAt: time.Now().Add(ttl),
        CreatedAt: time.Now(),
    }
    
    s.mu.Lock()
    s.records[key] = record
    s.mu.Unlock()
    
    // Persist the processing state
    if err := s.persister.Save(ctx, record); err != nil {
        return nil, err
    }
    
    // Execute the function
    response, err := fn()
    processedAt := time.Now()
    
    // Update record with result
    s.mu.Lock()
    record.Response = response
    record.Error = err
    record.ProcessedAt = &processedAt
    if err != nil {
        record.Status = StatusFailed
    } else {
        record.Status = StatusCompleted
    }
    s.mu.Unlock()
    
    // Persist the final state
    s.persister.Save(ctx, record)
    
    return response, err
}

// Example: Idempotent payment processing
func (s *PaymentService) ProcessPayment(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
    // Generate idempotency key from request
    key := generateIdempotencyKey(req)
    
    result, err := s.idempotencyStore.ProcessIdempotent(
        ctx,
        key,
        24*time.Hour, // Keep records for 24 hours
        func() (interface{}, error) {
            // Atomic transaction processing
            return s.processPaymentTransaction(ctx, req)
        },
    )
    
    if err != nil {
        return nil, err
    }
    return result.(*PaymentResponse), nil
}

// Atomic transaction processing
func (s *PaymentService) processPaymentTransaction(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
    // Use database transaction for atomicity
    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return nil, err
    }
    defer tx.Rollback()
    
    // Step 1: Validate accounts
    if err := s.validateAccounts(ctx, tx, req); err != nil {
        return nil, err
    }
    
    // Step 2: Process payment atomically
    paymentID, err := s.executePayment(ctx, tx, req)
    if err != nil {
        return nil, err
    }
    
    // Step 3: Commit transaction
    if err := tx.Commit(); err != nil {
        return nil, err
    }
    
    return &PaymentResponse{
        PaymentID: paymentID,
        Status:    "completed",
        Timestamp: time.Now(),
    }, nil
}

2. Checkpoint and Resume: Long-Running Operations

For operations that may exceed the termination grace period, implement checkpointing:

package checkpoint

import (
    "context"
    "encoding/json"
    "time"
)

type CheckpointStore interface {
    Save(ctx context.Context, id string, state interface{}) error
    Load(ctx context.Context, id string, state interface{}) error
    Delete(ctx context.Context, id string) error
}

type BatchProcessor struct {
    store          CheckpointStore
    checkpointFreq int
}

type BatchState struct {
    JobID      string    `json:"job_id"`
    TotalItems int       `json:"total_items"`
    Processed  int       `json:"processed"`
    LastItem   string    `json:"last_item"`
    StartedAt  time.Time `json:"started_at"`
}

func (p *BatchProcessor) ProcessBatch(ctx context.Context, jobID string, items []string) error {
    // Try to resume from checkpoint
    state := &BatchState{JobID: jobID}
    if err := p.store.Load(ctx, jobID, state); err == nil {
        log.Printf("Resuming job %s from item %d", jobID, state.Processed)
        items = items[state.Processed:]
    } else {
        // New job
        state = &BatchState{
            JobID:      jobID,
            TotalItems: len(items),
            Processed:  0,
            StartedAt:  time.Now(),
        }
    }
    
    // Process items with periodic checkpointing
    for i, item := range items {
        select {
        case <-ctx.Done():
            // Save progress before shutting down
            state.LastItem = item
            return p.store.Save(ctx, jobID, state)
        default:
            // Process item
            if err := p.processItem(ctx, item); err != nil {
                return err
            }
            
            state.Processed++
            state.LastItem = item
            
            // Checkpoint periodically
            if state.Processed%p.checkpointFreq == 0 {
                if err := p.store.Save(ctx, jobID, state); err != nil {
                    log.Printf("Failed to checkpoint: %v", err)
                }
            }
        }
    }
    
    // Job completed, remove checkpoint
    return p.store.Delete(ctx, jobID)
}

3. Circuit Breaker Pattern for Dependencies

Protect your service from cascading failures:

package circuitbreaker

import (
    "context"
    "sync"
    "time"
)

type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu              sync.RWMutex
    state           State
    failures        int
    successes       int
    lastFailureTime time.Time
    
    maxFailures      int
    resetTimeout     time.Duration
    halfOpenRequests int
}

func (cb *CircuitBreaker) Call(ctx context.Context, fn func() error) error {
    cb.mu.RLock()
    state := cb.state
    cb.mu.RUnlock()
    
    if state == StateOpen {
        // Check if we should transition to half-open
        cb.mu.Lock()
        if time.Since(cb.lastFailureTime) > cb.resetTimeout {
            cb.state = StateHalfOpen
            cb.successes = 0
            state = StateHalfOpen
        }
        cb.mu.Unlock()
    }
    
    if state == StateOpen {
        return ErrCircuitOpen
    }
    
    err := fn()
    
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    if err != nil {
        cb.failures++
        cb.lastFailureTime = time.Now()
        
        if cb.failures >= cb.maxFailures {
            cb.state = StateOpen
            log.Printf("Circuit breaker opened after %d failures", cb.failures)
        }
        return err
    }
    
    if state == StateHalfOpen {
        cb.successes++
        if cb.successes >= cb.halfOpenRequests {
            cb.state = StateClosed
            cb.failures = 0
            log.Println("Circuit breaker closed")
        }
    }
    
    return nil
}

Testing Your Implementation

Manual Testing Guide

Test 1: Startup Race Condition

Setup:

# Deploy both services
kubectl apply -f k8s/bad-service.yaml
kubectl apply -f k8s/good-service.yaml

# Watch pods in separate terminal
watch kubectl get pods -n demo

Test the bad service:

# Force restart
kubectl delete pod -l app=bad-service -n demo

# Observe: Pod will enter CrashLoopBackOff due to liveness probe
# killing it before 45s startup completes

Test the good service:

# Force restart
kubectl delete pod -l app=good-service -n demo

# Observe: Pod stays in 0/1 Ready state for ~45s, then becomes ready
# No restarts occur thanks to startup probe

Test 2: Data Consistency Under Failure

Setup:

# Deploy payment service with reconciliation enabled
kubectl apply -f k8s/payment-service.yaml

# Start payment traffic generator
kubectl run payment-generator --image=payment-client:latest \
  --restart=Never --rm -it -- \
  --target=payment-service.demo.svc.cluster.local:8080 \
  --rate=10 --duration=60s

Simulate SIGKILL during transactions:

# In another terminal, kill pods abruptly
while true; do
  kubectl delete pod -l app=payment-service -n demo --force --grace-period=0
  sleep 30
done

Verify reconciliation:

# Check for data inconsistencies
kubectl logs -l app=payment-service -n demo | grep "inconsistency"

# Monitor reconciliation metrics
kubectl port-forward svc/payment-service 8090:8090
curl http://localhost:8090/metrics | grep consistency

Test 3: RTO/RPO Validation

Disaster Recovery Simulation:

# Simulate regional failure
kubectl patch deployment payment-service -n demo \
  --patch '{"spec":{"replicas":0}}'

# Measure RTO - time to restore service
start_time=$(date +%s)
kubectl patch deployment payment-service -n demo \
  --patch '{"spec":{"replicas":3}}'

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod -l app=payment-service -n demo --timeout=900s
end_time=$(date +%s)
rto=$((end_time - start_time))

echo "RTO: ${rto} seconds"
if [ $rto -le 900 ]; then
  echo "? RTO target met (15 minutes)"
else
  echo "? RTO target exceeded"
fi

Test 4: Istio Resilience Features

Automatic Retry Testing:

# Deploy with fault injection
kubectl apply -f istio/fault-injection.yaml

# Generate requests with chaos header
for i in {1..100}; do
  grpcurl -H "x-chaos-test: true" -plaintext \
    payment-service.demo.svc.cluster.local:8080 \
    PaymentService/ProcessPayment \
    -d '{"amount": 100, "currency": "USD"}'
done

# Check Istio metrics for retry behavior
kubectl exec -n istio-system deployment/istiod -- \
  pilot-agent request GET stats/prometheus | grep retry

Monitoring and Observability

RTO/RPO Considerations

Recovery Time Objective (RTO): Target time to restore service after an outage Recovery Point Objective (RPO): Maximum acceptable data loss

Your service lifecycle design directly impacts these critical business metrics:

package monitoring

import (
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // RTO-related metrics
    ServiceStartupTime = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "service_startup_duration_seconds",
        Help: "Time from pod start to service ready",
        Buckets: []float64{1, 5, 10, 30, 60, 120, 300, 600}, // Up to 10 minutes
    }, []string{"service", "version"})
    
    ServiceRecoveryTime = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "service_recovery_duration_seconds", 
        Help: "Time to recover from failure state",
        Buckets: []float64{1, 5, 10, 30, 60, 300, 900}, // Up to 15 minutes
    }, []string{"service", "failure_type"})
    
    // RPO-related metrics
    LastCheckpointAge = promauto.NewGaugeVec(prometheus.GaugeOpts{
        Name: "last_checkpoint_age_seconds",
        Help: "Age of last successful checkpoint",
    }, []string{"service", "checkpoint_type"})
    
    DataConsistencyChecks = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "data_consistency_checks_total",
        Help: "Total number of consistency checks performed",
    }, []string{"service", "check_type", "status"})
    
    InconsistencyDetected = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "data_inconsistencies_detected_total",
        Help: "Total number of data inconsistencies detected",
    }, []string{"service", "inconsistency_type", "severity"})
)

Grafana Dashboard

{
  "dashboard": {
    "title": "Service Lifecycle - Business Impact",
    "panels": [
      {
        "title": "RTO Compliance",
        "description": "Percentage of recoveries meeting RTO target (15 minutes)",
        "targets": [{
          "expr": "100 * (histogram_quantile(0.95, service_recovery_duration_seconds_bucket) <= 900)"
        }],
        "thresholds": [
          {"value": 95, "color": "green"},
          {"value": 90, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "RPO Risk Assessment",
        "description": "Data at risk based on checkpoint age",
        "targets": [{
          "expr": "last_checkpoint_age_seconds / 60"
        }],
        "unit": "minutes"
      },
      {
        "title": "Data Consistency Status",
        "targets": [{
          "expr": "rate(data_inconsistencies_detected_total[5m])"
        }]
      }
    ]
  }
}

Production Readiness Checklist

Before deploying to production, ensure your service meets these criteria:

Application Layer

  • [ ] Implements separate liveness and readiness endpoints
  • [ ] Readiness checks validate all critical dependencies
  • [ ] Graceful shutdown drains in-flight requests
  • [ ] Idempotency for all state-modifying operations
  • [ ] Anti-entropy/reconciliation processes implemented
  • [ ] Circuit breakers for external dependencies
  • [ ] Checkpoint-and-resume for long-running operations
  • [ ] Structured logging with correlation IDs
  • [ ] Metrics for startup, shutdown, and health status

Kubernetes Configuration

  • [ ] Startup probe for slow-initializing services
  • [ ] Distinct liveness and readiness probes
  • [ ] Calculated terminationGracePeriodSeconds based on actual shutdown time
  • [ ] PreStop hooks for load balancer draining
  • [ ] Resource requests and limits defined
  • [ ] PodDisruptionBudget for availability
  • [ ] Anti-affinity rules for high availability

Service Mesh Integration

  • [ ] Istio sidecar lifecycle annotations (holdApplicationUntilProxyStarts)
  • [ ] Istio automatic retry policies configured
  • [ ] Circuit breaker configuration in DestinationRule
  • [ ] Distributed tracing enabled
  • [ ] mTLS for service-to-service communication

Data Integrity & Recovery

  • [ ] RTO/RPO metrics tracked and alerting configured
  • [ ] Reconciliation processes tested with Game Day exercises
  • [ ] Chaos engineering tests validate failure scenarios
  • [ ] Synthetic monitoring for end-to-end business flows
  • [ ] Backup and restore procedures documented and tested

Common Pitfalls and Solutions

1. My service keeps restarting during deployment:

Symptom: Pods enter CrashLoopBackOff during rollout

Common Causes:

  • Liveness probe starts before application is ready
  • Startup time exceeds probe timeout
  • Missing startup probe

Solution:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30  # 30 * 10s = 5 minutes
  periodSeconds: 10

2. Data corruption during pod restarts:

Symptom: Inconsistent database state after deployments

Common Causes:

  • Non-atomic operations
  • Missing idempotency
  • No reconciliation processes

Solution:

// Implement atomic operations with database transactions
tx, err := db.BeginTx(ctx, nil)
if err != nil {
    return err
}
defer tx.Rollback()

// All operations within transaction
if err := processPayment(tx, req); err != nil {
    return err // Automatic rollback
}

return tx.Commit()

3. Service mesh sidecar issues:

Symptom: ECONNREFUSED errors on startup

Common Causes:

  • Application starts before sidecar is ready
  • Sidecar terminates before application

Solution:

annotations:
  sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
  proxy.istio.io/config: |
    proxyMetadata:
      EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"

Conclusion

Service lifecycle management is not just about preventing outages—it’s about building systems that are predictable, observable, and resilient to the inevitable failures that occur in distributed systems. This allows:

  • Zero-downtime deployments: Services gracefully handle rollouts without data loss.
  • Improved reliability: Proper health checks prevent cascading failures.
  • Better observability: Clear signals about service state and data consistency.
  • Faster recovery: Services self-heal from transient failures.
  • Data integrity: Idempotency and reconciliation prevent corruption.
  • Compliance readiness: Meet RTO/RPO requirements for disaster recovery.
  • Financial protection: Prevent duplicate transactions and data corruption that could cost millions.

The difference between a service that “works on my machine” and one that thrives in production lies in these details. Whether you’re running on GKE, EKS, or AKS, these patterns form the foundation of production-ready microservices.

Want to test these patterns yourself? The complete code examples and deployment manifests are available on GitHub.

July 11, 2025

Building Resilient, Interactive Playbooks with Formicary

Filed under: Computing,Technology — admin @ 8:16 pm

In any complex operational environment, the most challenging processes are often those that can’t be fully automated. A CI/CD pipeline might be 99% automated, but that final push to production requires a sign-off. A disaster recovery plan might be scripted, but you need a human to make the final call to failover. These “human-in-the-loop” scenarios are where rigid automation fails and manual checklists introduce risk.

Formicary is a distributed orchestration engine designed to bridge this gap. It allows you to codify your entire operational playbook—from automated scripts to manual verification steps—into a single, version-controlled workflow. This post will guide you through Formicary‘s core concepts and demonstrate how to build two powerful, real-world playbooks:

  1. A Secure CI/CD Pipeline that builds, scans, and deploys to staging, then pauses for manual approval before promoting to production.
  2. A Semi-Automated Disaster Recovery Playbook that uses mocked Infrastructure as Code (IaC) to provision a new environment and waits for an operator’s go-ahead before failing over.

Formicary Features and Architecture

Formicary combines the robust workflow capabilities with the practical CI/CD features, all in a self-hosted, extensible platform.

Core Features

  • Declarative Workflows: Define complex jobs as a Directed Acyclic Graph (DAG) in a single, human-readable YAML file. Your entire playbook is version-controlled code.
  • Versatile Executors: A task is not tied to a specific runtime. Use the method that fits the job: KUBERNETES, DOCKER, SHELL, or even HTTP API calls.
  • Advanced Flow Control: Go beyond simple linear stages. Use on_exit_code to branch your workflow based on a script’s result, create polling “sensor” tasks, and define robust retry logic.
  • Manual Approval Gates: Explicitly define MANUAL tasks that pause the workflow and require human intervention to proceed via the UI or API.
  • Security Built-in: Manage secrets with database-level encryption and automatic log redaction. An RBAC model controls user access.

Architecture in a Nutshell

Formicary operates on a leader-follower model. The Queen server acts as the control plane, while one or more Ant workers form the execution plane.

  • Queen Server: The central orchestrator. It manages job definitions, schedules pending jobs based on priority, and tracks the state of all workers and executions.
  • Ant Workers: The workhorses. They register with the Queen, advertising their capabilities (e.g., supported executors and tags like gpu-enabled). They pick up tasks from the message queue and execute them.
  • Backend: Formicary relies on a database (like Postgres or MySQL) for state, a message queue (like Go Channels, Redis or Pulsar) for communication, and an S3-compatible object store for artifacts.

Getting Started: A Local Formicary Environment

The quickest way to get started is with the provided Docker Compose setup.

Prerequisites

  • Docker & Docker Compose
  • A local Kubernetes cluster (like Docker Desktop’s Kubernetes, Minikube, or k3s) with its kubeconfig file correctly set up. The embedded Ant worker will use this to run Kubernetes tasks.

Installation Steps

  1. Clone the Repository: git clone https://github.com/bhatti/formicary.git && cd formicary
  2. Launch the System:
    This command starts the Queen server, a local Ant worker, Redis, and MinIO object storage. docker-compose up
  3. Explore the Dashboard:
    Once the services are running, open your browser to http://localhost:7777.

Example 1: Secure CI/CD with Manual Production Deploy

Our goal is to build a CI/CD pipeline for a Go application that:

  1. Builds the application binary.
  2. Runs static analysis (gosec) and saves the report.
  3. Deploys automatically to a staging environment.
  4. Pauses for manual verification.
  5. If approved, deploys to production.

Here is the complete playbook definition:

job_type: secure-go-cicd
description: Build, scan, and deploy a Go application with a manual production gate.
tasks:
- task_type: build
  method: KUBERNETES
  container:
    image: golang:1.24-alpine
  script:
    - echo "Building Go binary..."
    - go build -o my-app ./...
  artifacts:
    paths: [ "my-app" ]
  on_completed: security-scan

- task_type: security-scan
  method: KUBERNETES
  container:
    image: securego/gosec:latest
  allow_failure: true # We want the report even if it finds issues
  script:
    - echo "Running SAST scan with gosec..."
    # The -no-fail flag prevents the task from failing the pipeline immediately.
    - gosec -fmt=sarif -out=gosec-report.sarif ./...
  artifacts:
    paths: [ "gosec-report.sarif" ]
  on_completed: deploy-staging

- task_type: deploy-staging
  method: KUBERNETES
  dependencies: [ "build" ]
  container:
    image: alpine:latest
  script:
    - echo "Deploying ./my-app to staging..."
    - sleep 5 # Simulate deployment work
    - echo "Staging deployment complete. Endpoint: http://staging.example.com"
  on_completed: verify-production-deploy

- task_type: verify-production-deploy
  method: MANUAL
  description: "Staging deployment complete. A security scan report is available as an artifact. Please verify the staging environment and the report before promoting to production."
  on_exit_code:
    APPROVED: promote-production
    REJECTED: rollback-staging

- task_type: promote-production
  method: KUBERNETES
  dependencies: [ "build" ]
  container:
    image: alpine:latest
  script:
    - echo "PROMOTING ./my-app TO PRODUCTION! This is a critical, irreversible step."
  on_completed: cleanup

- task_type: rollback-staging
  method: KUBERNETES
  container:
    image: alpine:latest
  script:
    - echo "Deployment was REJECTED. Rolling back staging environment now."
  on_completed: cleanup

- task_type: cleanup
  method: KUBERNETES
  always_run: true
  container:
    image: alpine:latest
  script:
    - echo "Pipeline finished."

Executing the Playbook

  1. Upload the Job Definition: curl -X POST http://localhost:7777/api/jobs/definitions \ -H "Content-Type: application/yaml" \ --data-binary @playbooks/secure-ci-cd.yaml
  2. Submit the Job Request: curl -X POST http://localhost:7777/api/jobs/requests \ -H "Content-Type: application/json" \ -d '{"job_type": "secure-go-cicd"}'
  3. Monitor and Approve:
    • Go to the dashboard. You will see the job run through build, security-scan, and deploy-staging.
    • The job will then enter the MANUAL_APPROVAL_REQUIRED state.
    • On the job’s detail page, you will see an “Approve” button next to the verify-production-deploy task.
    • To approve via the API, get the Job Request ID and the Task Execution ID from the UI or API, then run:

Once approved, the playbook will proceed to promote-production and run the final cleanup step.

Example 2: Semi-Automated Disaster Recovery Playbook

Now for a more critical scenario: failing over a service to a secondary region. This playbook uses mocked IaC steps and pauses for the crucial final decision.

job_type: aws-region-failover
description: A playbook to provision and failover to a secondary region.
tasks:
- task_type: check-primary-status
  method: KUBERNETES
  container:
    image: alpine:latest
  script:
    - echo "Pinging primary region endpoint... it's down! Initiating failover procedure."
    - exit 1 # Simulate failure to trigger the 'on_failed' path
  on_completed: no-op # This path is not taken in our simulation
  on_failed: provision-secondary-infra

- task_type: provision-secondary-infra
  method: KUBERNETES
  container:
    image: hashicorp/terraform:light
  script:
    - echo "Simulating 'terraform apply' to provision DR infrastructure in us-west-2..."
    - sleep 10 # Simulate time for infra to come up
    - echo "Terraform apply complete. Outputting simulated state file."
    - echo '{"aws_instance.dr_server": {"id": "i-12345dr"}}' > terraform.tfstate
  artifacts:
    paths: [ "terraform.tfstate" ]
  on_completed: verify-failover

- task_type: verify-failover
  method: MANUAL
  description: "Secondary infrastructure in us-west-2 has been provisioned. The terraform.tfstate file is available as an artifact. Please VERIFY COSTS and readiness. Approve to switch live traffic."
  on_exit_code:
    APPROVED: switch-dns
    REJECTED: teardown-secondary-infra

- task_type: switch-dns
  method: KUBERNETES
  container:
    image: amazon/aws-cli
  script:
    - echo "CRITICAL: Switching production DNS records to the us-west-2 environment..."
    - sleep 5
    - echo "DNS failover complete. Traffic is now routed to the DR region."
  on_completed: notify-completion

- task_type: teardown-secondary-infra
  method: KUBERNETES
  container:
    image: hashicorp/terraform:light
  script:
    - echo "Failover REJECTED. Simulating 'terraform destroy' for secondary infrastructure..."
    - sleep 10
    - echo "Teardown complete."
  on_completed: notify-completion

- task_type: notify-completion
  method: KUBERNETES
  always_run: true
  container:
    image: alpine:latest
  script:
    - echo "Disaster recovery playbook has concluded."

Executing the DR Playbook

The execution flow is similar to the first example. An operator would trigger this job, wait for the provision-secondary-infra task to complete, download and review the terraform.tfstate artifact, and then make the critical “Approve” or “Reject” decision.

Conclusion

Formicary helps you turn your complex operational processes into reliable, trackable workflows that run automatically. It uses containers to execute tasks and includes manual approval checkpoints, so you can automate your work with confidence. This approach reduces human mistakes while making sure people stay in charge of the important decisions.

June 14, 2025

Feature Flag Anti-Paterns: Learnings from Outages

Filed under: Computing — admin @ 9:54 pm

Feature flags are key components of modern infrastructure for shipping faster, testing in production, and reducing risk. However, they can also be a fast track to complex outages if not handled with discipline. Google’s recent major outage serves as a case study, and I’ve seen similar issues arise from missteps with feature flags. The core of Google’s incident revolved around a new code path in their “Service Control” system that should have been protected with a feature flag but wasn’t. This path, designed for an additional quota policy check, went directly to production without flag protection. When a policy change with unintended blank fields was replicated globally within seconds, it triggered the untested code path, causing a null pointer that crashed binaries globally. This incident perfectly illustrates why feature flags aren’t just nice-to-have—they’re essential guardrails that prevent exactly these kinds of global outages. Google also didn’t implement proper error handling, and the system didn’t use randomized exponential backoff that resulted in “thundering herd” effect that prolonged recovery.

Let’s dive into common anti-patterns I’ve observed and how we can avoid them:

Anti-Pattern 1: Inadequate Testing & Error Handling

This is perhaps the most common and dangerous anti-pattern. It involves deploying code behind a feature flag without comprehensive testing all states of that flag (on, off) and the various condition that interact with the flagged feature. It also includes neglecting robust error handling within the flagged code itself without defaulting flags to “off” in production. For example, Google’s Service Control binary crashed due to a null pointer when a new policy was propagated globally. This didn’t adequately tested the code path with empty input and failed to implement proper error handling. I’ve seen similar issues where teams didn’t test the code path protected with a feature flag in a test environment that only manifest in production. In other cases, the flag was accidentally left ON by default for production, leading to immediate issues upon deployment. The Google incident also mentions the problematic code “did not have appropriate error handling.” If the code within your feature flag assumes perfect conditions, it’s a ticking time bomb. These issues can be remedied by:

  • Default Off in Production: Ensure all new feature flags are disabled by default in production.
  • Comprehensive Testing: Test the feature with the flag ON and OFF. Crucially, test the specific conditions, data inputs, and configurations that trigger with the new code paths enabled by the flag.
  • Robust Error Handling: Implement proper error handling within the code controlled by the flag. It should fail gracefully or revert to a safe state if an unexpected issue occurs, not bring down the service.
  • Consider Testing Costs: If testing all combinations becomes prohibitively expensive or complex, it might indicate the feature is too large for a single flag and should be broken down.

Anti-Pattern 2: Inadequate Peer Review

This anti-pattern manifests when feature flag changes occur without a proper review process. It’s like making direct database changes in production without a change request. For example, Google’s issue was a policy metadata change rather than a direct flag toggle where metadata replicated globally within seconds. It is analogous to flipping a critical global flag without due diligence. If that policy metadata change had been managed like a code change (e.g., via GitOps or Config as a Code with canary rollout, the issue might have been caught earlier. This can be remedied with:

  • GitOps/Config-as-Code: Manage feature flag configurations as code within your Git repository. This enforces PRs, peer reviews, and provides an auditable history.
  • Test Flag Rollback: As part of your process, ensure you can easily and reliably roll back a feature flag configuration change, just like you would with code.
  • Detect Configuration Drift: Ensure that the actual state in production does not drift from what’s expected or version-controlled.

Anti-Pattern 3: Inadequate Authorization and Auditing

This means not protecting enabling/disabling feature flags with proper permissions. Internally, if anyone can flip a production flag via a UI without a PR or a second pair of eyes, we’re exposed. Also, if there’s no clear record of who changed it, when, and why, incident response becomes a frantic scramble. Remedies include:

  • Strict Access Control: Implement strong Role-Based Access Control (or Relationship-Based Access Control) to limit who can modify flag states or configurations in production.
  • Comprehensive Auditing: Ensure your feature flagging system provides detailed audit logs for every change: who made the change, what was changed, and when.

Anti-Pattern 4: No Monitoring

Deploying a feature behind a flag and then flipping it on for everyone without closely monitoring its impact is like walking into a dark room and hoping you don’t trip. This can be remedied by actively monitoring feature flags and collecting metrics on your observability platform. This includes tracking not just the flag’s state (on/off) but also its real-time impact on key system metrics (error rates, latency, resource consumption) and relevant business KPIs.

Anti-Pattern 5: No Phased Rollout or Kill Switch

This means turning a new, complex feature on for 100% of users simultaneously with a flag. For example, during Google’s incident, major changes to quota management settings were propagated immediately causing global outage. The “red-button” to disable the problematic serving path was crucial for their recovery. Remedies for this anti-pattern include:

  • Canary Releases & Phased Rollouts: Don’t enable features for everyone at once. Perform canary releases: enable for internal users, then a small percentage of production users while monitoring metrics.
  • “Red Button” Control: Have a clear, a “kill switch” or “red button” mechanism for quickly and globally disabling any problematic feature flag if issues arise.

Anti-Pattern 6: Thundering Herd

Enabling a feature flag can potentially change traffic patterns for incoming requests. For example, Google didn’t implement randomized exponential backoff in Service Control that caused “thundering herd” on underlying infrastructure. To prevent such issues, implement exponential backoff with jitter for request retries, combined with comprehensive monitoring.

Anti-Pattern 7: Misusing Flags for Config or Entitlements

Using feature flags as a general-purpose configuration management system or to manage complex user entitlements (e.g., free vs. premium tiers). For example, I’ve seen teams use feature flags to store API endpoints, timeout values, or rules about which customer tier gets which sub-feature. This means that your feature flag system becomes a de-facto distributed configuration database. This can be remedied with:

  • Purposeful Flags: Use feature flags primarily for controlling the lifecycle of discrete features: progressive rollout, A/B testing, kill switches.
  • Dedicated Systems: Use proper configuration management tools for application settings and robust entitlement systems for user permissions and plans.

Anti-Pattern 8: The “Zombie Flag” Infestation

Introducing feature flags but never removing them once a feature is fully rolled out or stable. I’ve seen codebases littered with if (isFeatureXEnabled) checks for features that have been live for years or were abandoned. This can be remedied with:

  • Lifecycle Management: Treat flags as having a defined lifespan.
  • Scheduled Cleanup: Regularly audit flags. Once a feature is 100% rolled out and stable (or definitively killed), schedule work to remove the flag and associated dead code.

Anti-Pattern 9: Ignoring Flagging Service Health

This means not considering how your application behaves if the feature flagging service itself experiences an outage or is unreachable. A crucial point in Google’s RCA was that their “Cloud Service Health infrastructure being down due to this outage” delayed communication. A colleague once pointed out: what happens if LaunchDarkly is down? This can be remedied with:

  • Safe Defaults in Code: When your code requests a flag’s state from the SDK (e.g., ldClient.variation("my-feature", user, **false**)), the provided default value is critical. For new or potentially risky features, this default must be the “safe” state (feature OFF).
  • SDK Resilience: Feature-Flag SDKs are designed to cache flag values and use them if the service is unreachable (stasis). But on a fresh app start before any cache is populated, your coded defaults are your safety net.

Summary

Feature flags are incredibly valuable for modern software development. They empower teams to move faster and release with more confidence. But as the Google incident and my own experiences show, they require thoughtful implementation and ongoing discipline. By avoiding these anti-patterns – by testing thoroughly, using flags for their intended purpose, managing their lifecycle, governing changes, and planning for system failures – we can ensure feature flags remain a powerful asset.

March 24, 2025

K8 Highlander: Managing Stateful and Singleton Processes in Kubernetes

Filed under: Technology — Tags: , — admin @ 2:22 pm

Introduction

Kubernetes has revolutionized how we deploy, scale, and manage applications in the cloud. I’ve been using Kubernetes for many years to build scalable, resilient, and maintainable services. However, Kubernetes was primarily designed for stateless applications – services that can scale horizontally. While such shared-nothing architecture is must-have for most modern microservices but it presents challenges for use-cases such as:

  1. Stateful/Singleton Processes: Applications that must run as a single instance across a cluster to avoid conflicts, race conditions, or data corruption. Examples include:
    • Legacy applications not designed for distributed operation
    • Batch processors that need exclusive access to resources
    • Job schedulers that must ensure jobs run exactly once
    • Applications with sequential ID generators
  2. Active/Passive Disaster Recovery: High-availability setups where you need a primary instance running with hot standbys ready to take over instantly if the primary fails.

Traditional Kubernetes primitives like StatefulSets provide stable network identities and ordered deployment but don’t solve the “exactly-one-active” problem. DaemonSets ensure one pod per node, but don’t address the need for a single instance across the entire cluster. This gap led me to develop K8 Highlander – a solution that ensures “there can be only one” active instance of your workloads while maintaining high availability through automatic failover.

Architecture

K8 Highlander implements distributed leader election to ensure only one controller instance is active at any time, with others ready to take over if the leader fails. The name “Highlander” refers to the tagline from the 1980s movie & show: “There can be only one.”

Core Components

K8 Highlander Architecture

The system consists of several key components:

  1. Leader Election: Uses distributed locking (via Redis or a database) to ensure only one controller is active at a time. The leader periodically renews its lock, and if it fails, another controller can acquire the lock and take over.
  2. Workload Manager: Manages different types of workloads in Kubernetes, ensuring they’re running and healthy when this controller is the leader.
  3. Monitoring Server: Provides real-time metrics and status information about the controller and its workloads.
  4. HTTP Server: Serves a dashboard and API endpoints for monitoring and management.

How Leader Election Works

The leader election process follows these steps:

  1. Each controller instance attempts to acquire a distributed lock with a TTL (Time-To-Live)
  2. Only one instance succeeds and becomes the leader
  3. The leader periodically renews its lock to maintain leadership
  4. If the leader fails to renew (due to crash, network issues, etc.), the lock expires
  5. Another instance acquires the lock and becomes the new leader
  6. The new leader starts managing workloads

This approach ensures high availability while preventing split-brain scenarios where multiple instances might be active simultaneously.

Workload Types

K8 Highlander supports four types of workloads:

  1. Process Workloads: Single-instance processes running in pods
  2. CronJob Workloads: Scheduled tasks that run at specific intervals
  3. Service Workloads: Continuously running services using Deployments
  4. Persistent Workloads: Stateful applications with persistent storage using StatefulSets

Each workload type is managed to ensure exactly one instance is running across the cluster, with automatic recreation if terminated unexpectedly.

Deploying and Using K8 Highlander

Let me walk through how to deploy and use K8 Highlander for your singleton workloads.

Prerequisites

  • Kubernetes cluster (v1.16+)
  • Redis server or PostgreSQL database for leader state storage
  • kubectl configured to access your cluster

Installation Using Docker

The simplest way to install K8 Highlander is using the pre-built Docker image:

# Create a namespace for k8-highlander
kubectl create namespace k8-highlander

# Create a ConfigMap with your configuration
kubectl create configmap k8-highlander-config \
  --from-file=config.yaml=./config/config.yaml \
  -n k8-highlander

# Deploy k8-highlander
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8-highlander
  namespace: k8-highlander
spec:
  replicas: 2  # Run multiple instances for HA
  selector:
    matchLabels:
      app: k8-highlander
  template:
    metadata:
      labels:
        app: k8-highlander
    spec:
      containers:
      - name: controller
        image: plexobject/k8-highlander:latest
        env:
        - name: HIGHLANDER_REDIS_ADDR
          value: "redis:6379"
        - name: HIGHLANDER_TENANT
          value: "default"
        - name: HIGHLANDER_NAMESPACE
          value: "default"
        - name: CONFIG_PATH
          value: "/etc/k8-highlander/config.yaml"
        ports:
        - containerPort: 8080
          name: http
        volumeMounts:
        - name: config-volume
          mountPath: /etc/k8-highlander
      volumes:
      - name: config-volume
        configMap:
          name: k8-highlander-config
---
apiVersion: v1
kind: Service
metadata:
  name: k8-highlander
  namespace: k8-highlander
spec:
  selector:
    app: k8-highlander
  ports:
  - port: 8080
    targetPort: 8080
EOF

This deploys K8 Highlander with your configuration, ensuring high availability with multiple replicas while maintaining the singleton behavior for your workloads.

Using K8 Highlander Locally for Testing

You can also run K8 Highlander locally for testing:

docker run -d --name k8-highlander \
  -v $(pwd)/config.yaml:/etc/k8-highlander/config.yaml \
  -e HIGHLANDER_REDIS_ADDR=redis-host:6379 \
  -p 8080:8080 \
  plexobject/k8-highlander:latest

Basic Configuration

K8 Highlander uses a YAML configuration file to define its behavior and workloads. Here’s a simple example:

id: "controller-1"
tenant: "default"
port: 8080
namespace: "default"

# Storage configuration
storageType: "redis"
redis:
  addr: "redis:6379"
  password: ""
  db: 0

# Cluster configuration
cluster:
  name: "primary"
  kubeconfig: ""  # Uses in-cluster config if empty

# Workloads configuration
workloads:
  # Process workload example
  processes:
    - name: "data-processor"
      image: "mycompany/data-processor:latest"
      script:
        commands:
          - "echo 'Starting data processor'"
          - "/app/process-data.sh"
        shell: "/bin/sh"
      env:
        DB_HOST: "postgres.example.com"
      resources:
        cpuRequest: "200m"
        memoryRequest: "256Mi"
      restartPolicy: "OnFailure"

Example Workload Configurations

Let’s look at examples for each workload type:

Process Workload

Use this for single-instance processes that need to run continuously:

processes:
  - name: "sequential-id-generator"
    image: "mycompany/id-generator:latest"
    script:
      commands:
        - "echo 'Starting ID generator'"
        - "/app/run-id-generator.sh"
      shell: "/bin/sh"
    env:
      DB_HOST: "postgres.example.com"
    resources:
      cpuRequest: "200m"
      memoryRequest: "256Mi"
    restartPolicy: "OnFailure"

CronJob Workload

For scheduled tasks that should run exactly once at specified intervals:

cronJobs:
  - name: "daily-report"
    schedule: "0 0 * * *"  # Daily at midnight
    image: "mycompany/report-generator:latest"
    script:
      commands:
        - "echo 'Generating daily report'"
        - "/app/generate-report.sh"
      shell: "/bin/sh"
    env:
      REPORT_TYPE: "daily"
    restartPolicy: "OnFailure"

Service Workload

For continuously running services that need to be singleton but highly available:

services:
  - name: "admin-api"
    image: "mycompany/admin-api:latest"
    replicas: 1
    ports:
      - name: "http"
        containerPort: 8080
        servicePort: 80
    env:
      LOG_LEVEL: "info"
    resources:
      cpuRequest: "100m"
      memoryRequest: "128Mi"

Persistent Workload

For stateful applications with persistent storage:

persistentSets:
  - name: "message-queue"
    image: "ibmcom/mqadvanced-server:latest"
    replicas: 1
    ports:
      - name: "mq"
        containerPort: 1414
        servicePort: 1414
    persistentVolumes:
      - name: "data"
        mountPath: "/var/mqm"
        size: "10Gi"
    env:
      LICENSE: "accept"
      MQ_QMGR_NAME: "QM1"

High Availability Setup

For production environments, run multiple instances of K8 Highlander to ensure high availability:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8-highlander
  namespace: k8-highlander
spec:
  replicas: 3  # Run multiple instances for HA
  selector:
    matchLabels:
      app: k8-highlander
  template:
    metadata:
      labels:
        app: k8-highlander
    spec:
      containers:
      - name: controller
        image: plexobject/k8-highlander:latest
        env:
        - name: HIGHLANDER_REDIS_ADDR
          value: "redis:6379"
        - name: HIGHLANDER_TENANT
          value: "production"

API and Monitoring Capabilities

K8 Highlander provides comprehensive monitoring, metrics, and API endpoints for observability and management.

Dashboard

Access the built-in dashboard at http://<controller-address>:8080/ to see the status of the controller and its workloads in real-time.

Leader Dashboard Screenshot

The dashboard shows:

  • Current leader status
  • Workload health and status
  • Redis/database connectivity
  • Failover history
  • Resource usage

API Endpoints

K8 Highlander exposes several HTTP endpoints for monitoring and integration:

  • GET /status: Returns the current status of the controller
  • GET /api/workloads: Lists all managed workloads and their status
  • GET /api/workloads/{name}: Gets the status of a specific workload
  • GET /healthz: Liveness probe endpoint
  • GET /readyz: Readiness probe endpoint

Example API response:

{
  "status": "success",
  "data": {
    "isLeader": true,
    "leaderSince": "2023-05-01T12:34:56Z",
    "lastLeaderTransition": "2023-05-01T12:34:56Z",
    "uptime": "1h2m3s",
    "leaderID": "controller-1",
    "workloadStatus": {
      "processes": {
        "data-processor": {
          "active": true,
          "namespace": "default"
        }
      }
    }
  }
}

Prometheus Metrics

K8 Highlander exposes Prometheus metrics at /metrics for monitoring and alerting:

# HELP k8_highlander_is_leader Indicates if this instance is currently the leader (1) or not (0)
# TYPE k8_highlander_is_leader gauge
k8_highlander_is_leader 1
# HELP k8_highlander_leadership_transitions_total Total number of leadership transitions
# TYPE k8_highlander_leadership_transitions_total counter
k8_highlander_leadership_transitions_total 1
# HELP k8_highlander_workload_status Status of managed workloads (1=active, 0=inactive)
# TYPE k8_highlander_workload_status gauge
k8_highlander_workload_status{name="data-processor",namespace="default",type="process"} 1

Key metrics include:

  • Leadership status and transitions
  • Workload health and status
  • Redis/database operations
  • Failover events and duration
  • System resource usage

Grafana Dashboard

A Grafana dashboard is available for visualizing K8 Highlander metrics. Import the dashboard from the dashboards directory in the repository.

Advanced Features

Multi-Tenant Support

K8 Highlander supports multi-tenant deployments, where different teams or environments can have their own isolated leader election and workload management:

# Tenant A configuration
id: "controller-1"
tenant: "tenant-a"
namespace: "tenant-a"
# Tenant B configuration
id: "controller-2"
tenant: "tenant-b"
namespace: "tenant-b"

Each tenant has its own leader election process, so one controller can be the leader for tenant A while another is the leader for tenant B.

Multi-Cluster Deployment

For disaster recovery scenarios, K8 Highlander can be deployed across multiple Kubernetes clusters with a shared Redis or database:

# Primary cluster
id: "controller-1"
tenant: "production"
cluster:
  name: "primary"
  kubeconfig: "/path/to/primary-kubeconfig"
# Secondary cluster
id: "controller-2"
tenant: "production"
cluster:
  name: "secondary"
  kubeconfig: "/path/to/secondary-kubeconfig"

If the primary cluster fails, a controller in the secondary cluster can become the leader and take over workload management.

Summary

K8 Highlander fills a critical gap in Kubernetes’ capabilities by providing reliable singleton workload management with automatic failover. It’s ideal for:

  • Legacy applications that don’t support horizontal scaling
  • Processes that need exclusive access to resources
  • Scheduled jobs that should run exactly once
  • Active/passive high-availability setups

The solution ensures high availability without sacrificing the “exactly one active” constraint that many applications require. By handling the complexity of leader election and workload management, K8 Highlander allows you to run stateful workloads in Kubernetes with confidence.

Where to Go from Here

K8 Highlander is an open-source project with MIT license, and contributions are welcome! Feel free to submit issues, feature requests, or pull requests to help improve the project.


March 4, 2025

Code Fossils: My Career Path Through Three Decades of Now-Obsolete Technologies

Filed under: Computing — admin @ 9:45 pm

Introduction

I became seriously interested in computers after learning about Microprocessor architecture during a special summer camp program at school that taught us about how computer systems work among other topics. I didn’t have easy access to computers at my school and I learned a bit more about programming on an Atari system. This hooked me to take some private lessons on programming languages and pursue computer science college studies in college and a career as a software developer spanning three decades.

My professional journey started mostly with mainframe systems and then I shifted more towards UNIX systems and then later to Linux environments. Along the way, I’ve witnessed entire technological ecosystems rise, thrive, and ultimately vanish like programming languages abandoned despite their elegance, operating systems forgotten despite their robustness, and frameworks discarded despite their innovation. I will dig through my personal experience with some of the archaic technologies that have largely disappeared or diminished in importance. I’ve deliberately omitted technologies I still use regularly to spotlight these digital artifacts. These extinct technologies shaped how we approach computing problems and contain the DNA of our current systems. They remind us that today’s indispensable technologies may someday join them in the digital graveyard.

Programming Languages

BASIC & GW-BASIC

I initially learned BASIC on an Atari system and later learned GW-BASIC, which introduced me to graphics programming on IBM XT computers running early DOS versions. The use of line numbers organizing program flow with GOTO and GOSUB statements seemed strange to me but its simplicity helped me to build create programs with sounds and graphics. Eventually, I moved to Microsoft QuickBASIC that had support for procedures and structured programming. This early taste of programming led me to pursue Computer Science in college. I sometimes worry about today’s beginners facing overwhelming complexity like networking, concurrency, and performance optimization just to build a simple web application. BASIC on the other hand was very accessible and rewarding for newcomers despite its limitations.


Pascal & Turbo Pascal

College introduced me to both C and Pascal through Borland’s Turbo compilers. I liked cleaner and more readable syntax of Pascal compared to C. At the time, C had best performance so Pascal couldn’t gain wide adoption and it has largely disappeared from mainstream development. Interestingly, career of Turbo Pascal’s author, Anders Hejlsberg was saved by Microsoft who went on to create C# and later TypeScript. This trajectory taught me that technical superiority alone doesn’t ensure survival.


FORTRAN

During a college internship at a physics laboratory, I learned about FORTRAN running on massive DEC-VAX/VMS systems, which was very popular among scientific computing at the time. While FORTRAN maintains a niche presence in scientific circles but DEC VAX/VMS systems have vanished entirely from the computing landscape. VMS systems were known for powerful, reliable and stable computing environments but DEC failed to adapt to the industry’s shift toward smaller, more cost-effective systems. The market ultimately embraced UNIX variants that offered comparable capabilities at lower price points with greater flexibility. This transition taught me an early lesson in how economic factors often trump technical superiority.


COBOL, CICS and Assembler

My professional career at a marketing firm began with COBOL, CICS, and Assembler on mainframe. JCL (Job Control Language) was used to submit the mainframe jobs that had unforgiving syntax where a misplaced comma could derail an entire batch job. I used COBOL for batch processing applications that primarily processed sequential ISAM files or the more advanced VSAM files with their B-Tree indexing for direct data access. These batch jobs often ran for hours or even days that created long feedback cycles where a single error could cause cascading delays and missed deadlines.

I used CICS for building interactive applications with their distinctive green-screen terminals. I had to use BMS (Basic Mapping Support) for designing the 3270 terminal screen layouts, which was notoriously finicky language. I built my own tool to convert plain text layouts into proper BMS syntax so that I didn’t have to debug syntax errors. The most challenging language that I had to use was mainframe Assembler, which was used for performance-critical system components. These programs were monolithic workhorses —thousands of lines of code in single routines with custom macros simulating higher-level programming constructs. Thanks to the exponential performance improvements in modern hardware, most developers rarely need to descend to this level of programming.


PERL

I first learned PERL in college and embraced it throughout the 1990s as a versatile tool for both system administration and general-purpose programming. Its killer feature—regular expressions—made it indispensable for text processing tasks that would have been painfully complex in other languages. At a large credit verification company, I leveraged PERL’s pattern-matching to automate massive codebase migrations, transforming thousands of lines of code from one library to another. Later, at a major airline, I used similar techniques to upgrade legacy systems to newer WebLogic APIs without manual rewrites.

In the web development arena, I used PERL to build early CGI applications and it was a key component of revolutionary LAMP stack (Linux, Apache, MySQL, PERL) before PHP/Python supplanted it. The CPAN repository was another groundbreaking innovation that allowed reusing shared libraries at scale. I used it along with Mason web templating system at a large online retailer in the mid 2000s and then migrated some of those applications to Java as PERL based systems were difficult to maintain. I found similar experience with other PERL codebases and I eventually moved to Python, which offered cleaner object-oriented design patterns and syntax. Its cultural impact—from the camel book to CPAN—influenced an entire generation of programmers, myself included.


4th Generation Languages

Early in my career, Fourth Generation Languages (4GLs) promised a dramatic boost productivity by providing a simple UI for managing the data. On mainframe systems, I used Focus and SAS for data queries and analytics, creating reports and processing data with a few lines of code. For desktop applications, I used a variety of 4GL environments including dBase III/IV, FoxPro, Paradox, and Visual Basic. These tools were remarkable for their time, offering “query by example” interfaces that allowed quickly build database applications with minimal coding. However, as data volumes grew, the limitations of these systems became painfully apparent. Eventually, I transitioned to object-oriented languages paired with enterprise relational databases that offered better scalability and maintainability. Nevertheless, these tools represent an important evolutionary step that influenced modern RAD (Rapid Application Development) approaches and low-code platforms that continue to evolve today.


Operating Systems

Mainframe Legacy

My career began at a marketing company working on IBM 360/390 mainframes running MVS (Multiple Virtual Storage). I used a combination of JCL, COBOL, CICS, and Assembler to build batch applications that processed millions of customer records. Working with JCL (Job Control Language) was particularly challenging due to its incredibly strict syntax where a single misplaced comma could cause an entire batch run to fail. The feedback cycle was painfully slow; submitting a job often meant waiting hours or even overnight for results. We had to use extensive “dry runs” of jobs to test the business logic —a precursor to what we now call unit testing. Despite these precautions, mistakes happened, and I witnessed firsthand how a simple programming error caused the company to mail catalogs to incorrect and duplicate addresses, costing millions in wasted printing and postage.

These systems also had their quirks: they used EBCDIC character encoding rather than the ASCII standard found in most other systems. They also stored data inefficiently—a contributing factor to the infamous Y2K crisis, as programs commonly stored years as two digits to save precious bytes of memory in an era when storage was extraordinarily expensive. Terminal response times were glacial by today’s standards—I often had to wait several seconds to see what I’d typed appear on screen. Yet despite their limitations, these mainframes offered remarkable reliability. While the UNIX systems I later worked with would frequently crash with core dumps (typically from memory errors in C programs), mainframe systems almost never went down. This stability, however, came partly from their simplicity—most applications were essentially glorified loops processing input files into output files without the complexity of modern systems.


UNIX Variants

Throughout my career, I worked extensively with numerous UNIX variants descended from both AT&T’s System V and UC Berkeley’s BSD lineages. At multiple companies, I deployed applications on Sun Microsystems hardware running SunOS (BSD-based) and later Solaris (System V-based). These systems, while expensive, provided the superior graphics capability, reliability and performance needed for mission-critical applications. I used SGI’s IRIX operating system running on impressive graphical workstations when working at a large physics lab. These systems processed massive datasets from physics experiments by leveraging non-uniform memory access (NUMA) and symmetric multi-processing (SMP) based architecture. IRIX was among the first mainstream 64-bit operating systems, pushing computational boundaries years before this became standard. They were used for visual effects in movies like Jurassic Park to life in 1993, which was amazing to watch. I also worked with IBM’s AIX on SP1/SP2 supercomputers at the physics lab, using Message Passing Interface (MPI) APIs to distribute processing across hundreds of nodes. This message-passing approach ultimately proved more scalable than shared-memory architectures, though modern systems incorporate both paradigms—today’s multi-core processors rely on the same NUMA/SMP concepts pioneered in these early UNIX variants.

On the down side, these systems were very expensive and Moore’s Law enabled commodity PC hardware running Linux to achieve comparable performance at a fraction of the price. I saw a lot of those large systems replaced with a farm of low-cost PCS based on Linux clusters that reduced infrastructure costs drastically. I was deeply passionate about UNIX and even spent most of my savings in the early ’90s on a high-end PowerPC system, which was result of a partnership between IBM, Motorola, Apple, and Sun. This machine could run multiple operating systems including Solaris and AIX, though I primarily used it for personal projects and learning.


DOS, OS/2, SCO and BeOS

For personal computing in the 1980s and early 1990s, I primarily used MS-DOS, even developing several shareware applications and games that I sold through bulletin board systems. DOS, with its command-line interface and conventional/expanded memory limitations taught me valuable lessons about resource optimization that remain relevant even in today. I preferred UNIX-like environments whenever possible so I installed SCO UNIX (based on Microsoft’s Xenix) on my personal computer. SCO was initially respected in the industry before it transformed into a patent troll with controversial patent lawsuits against Linux distributors. I also liked OS/2 and it was a technically superior operating system developed compared to Windows with its support of true pre-emptive multitasking. But it lost to Windows due to massive Microsoft’s market power similar to other innovative competitors like Borland, Novell, and Netscape.

Perhaps the most elegant of these alternative systems was BeOS, which I eagerly tested in the mid-1990s when it released in beta. It supported microkernel design and pervasive multithreading capabilities, and was a serious contender for Apple’s next-generation OS. However, Apple ultimately acquired NeXT instead, bringing Steve Jobs back and adopting NeXTSTEP as the foundation—another case where superior technology lost to business considerations and personal relationships.


Storage Media

My first PC had a modest 40MB hard drive and I relied heavily on floppy disks in both 5.25-inch and later 3.5-inch formats. They took a long time to copy data and made a lot of scratching sounds as both progress indicators and early warning systems for impending failures. In professional environments, I worked with SCSI drives that had better speed and reliability. I generally employed RAID configurations to protect against drive failures. For archiving and backup, I generally used tape drives that were also painfully slow but could store much more data. In mid-1990s, I switched to Iomega’s Zip drives from floppy disks for personal backups that could store up to 100MB compared to 1.44MB floppies. Similarly, I used CD-R and later CD-RW drives for storage that also had slow write speeds initially.


Network Protocols

In my early career, networking was fairly fragmented and I generally used Novell’s proprietary IPX (Internetwork Packet Exchange) protocol and Novell NetWare networks at work. It provided nice support of file sharing and printing service. On mainframe systems, I worked with Token Ring networks that offered more reliable deterministic performance. As the internet was based on TCP/IP, it eventually took over along with UNIX and Linux systems. For file sharing across these various systems, I relied on NFS (Network File System) in UNIX environments and later Samba to bridge the gap between UNIX and Windows systems that used SMB (Server Message Block) protocol. Both solutions were plagued with performance issues due to file locking issues. I spent countless hours troubleshooting “stale file handles” and unexpected disconnections that plagued these early networked file systems.


Databases

My database journey began on mainframe systems with IBM’s VSAM (Virtual Storage Access Method), which wasn’t a true database but provided crucial B-Tree indexing for efficient file access. I also worked with IBM’s IMS, a hierarchical database that organized data in parent-child tree structures. The relational databases were truly revolutionary at the time and I embraced systems like IBM DB2, Oracle, and Microsoft SQL Server. In my college, I took a number of courses in theory of relational databases and appreciated its strong mathematical foundations. However, most of those relational databases were commercial and expensive and I looked at open source projects like MiniSQL but it lacked critical enterprise features like transaction support.

In mid 1990s, I saw object-oriented databases gained popularity along with object-oriented programming that promised to eliminate the “impedance mismatch” between object models and relational tables. I evaluated ObjectStore for some projects and ultimately deployed Versant to manage complex navigation data for traffic mapping systems—predecessors to today’s Google Maps services. These databases elegantly handled complex object relationships and inheritance hierarchies, but introduced their own challenges in querying, scaling, and integration with existing systems. The relational databases later absorbed object-oriented concepts like user-defined types, XML support, and JSON capabilities. Looking back, it taught me that systems built on strong theoretical foundations with incremental adaptation tend to outlast revolutionary approaches.


Security and Authentication

Early in my career, I worked as a UNIX system administrator and relied on /etc/passwd files for authentication that were world-readable, containing password hashes generated with the easily crackable crypt algorithm. For multi-system environments, I used NIS (Network Information Service) to centrally manage user accounts across server clusters. We also commonly used .rhosts files to allow password-less authentication between trusted systems. I later used Kerberos authentication systems to provide stronger single sign-on capabilities for enterprise environments. When working at a large airline, I used Netegrity SiteMinder to implement single sign-on based access. While consulting for a manufacturing company, I built SSO implementations using LDAP and Microsoft Active Directory across heterogeneous systems. The Java ecosystem brought its own authentication frameworks and I worked extensively with JAAS (Java Authentication and Authorization Service) and later Acegi Security before moving to SAML (Security Assertion Markup Language) and OAuth based authentication standards.


Applications & Development Tools

Desktop Applications (Pre-Web)

My early word processing was done in WordStar with its cryptic Ctrl-key commands, before moving to WordPerfect, which offered better formatting control. For technical documentation, I relied on FrameMaker that supported sophisticated layout for complex documents. For spreadsheets, I initially used VisiCalc, which was the original “killer app” on Apple II but later Lotus 1-2-3, which revolutionized common keyboard shortcuts that still exist in Excel today. When working for a marketing company, I used Lotus Notes, a collaboration tool that functioned as an email client, calendar, document management system, and application development platform. On UNIX workstations, I preferred text-based applications like elm and pine for email and lynx text browser when accessing remote machines on telnet.


Chat & Communication Tools

On early UNIX systems at work, I used the simple ‘talk’ command to chat with other users on the system. At home during the pre-internet era, I immersed myself in the Bulletin Board System (BBS) culture. I also hosted my own BBS, learning firsthand about the challenges of building and maintaining online communities. I used CompuServe for access to group forums and Internet Relay Chat (IRC) through painfully slow dial-up and later SLIP/PPP connections. My fascination with IRC led me to develop my own client application, PlexIRC, which I distributed as shareware. As graphical interfaces took over, I adopted ICQ and Yahoo Messenger for personal communications. These platforms introduced status indicators, avatars, and file transfers that we now take for granted. While AOL Instant Messenger dominated the American market, I deliberately avoided the AOL ecosystem, preferring more open alternatives. My professional interest gravitated toward Jabber, which later evolved into the XMPP protocol standard with its federated approach to messaging—allowing different servers to communicate like email. I later implemented XMPP-based messaging solutions for several organizations, appreciating its extensible framework and standardized approach.


Development Environments

On UNIX systems, I briefly wrestled with ‘ed’—a line editor so primitive by today’s standards that its error message was simply a question mark. I quickly graduated to Vi, whose keyboard shortcuts became muscle memory that persists to this day through modern incarnations like Vim and NeoVim. In the DOS world, Borland Sidekick revolutionized my workflow as one of the first TSR (Terminate and Stay Resident) applications. With a quick keystroke, Sidekick would pop up a notepad, calculator, or calendar without exiting the primary application. For debugging and system maintenance, I used Norton Utilities that provided essential tools like disk recovery, defragmentation, and a powerful hex editor that saved countless hours when troubleshooting low-level issues. I learned about the IDE (Integrated Development Environment) through Borland’s groundbreaking products like Turbo Pascal and Turbo C that combined fast compilers with editing and debugging in a seamless package. These evolved into more sophisticated tools like Borland C++ with its application frameworks. For specialized work, I used Watcom C/C++ for its cross-platform capabilities and optimization features. As Java gained prominence, I adopted tools like JBuilder and Visual Cafe, which pioneered visual development for the platform. Eventually, I moved to Eclipse and later IntelliJ IDEA, alongside Visual Studio. Though, I still enable Vi mode on these IDEs due to its powerful editing capabilities without the need of mouse.


Web Technologies

I experienced the early internet ecosystem in college—navigating Gopher menus for document retrieval, searching with WAIS, and participating in Usenet newsgroups. Everything changed with the release of NCSA HTTPd server and the Mosaic browser. I used these revolutionary tools on Sun workstations in college and later at a high-energy physics laboratory on UNIX workstations. I left my cushy job to find web related projects and secured a consulting position at a financial institution building web access for credit card customers. I used C/C++ with CGI (Common Gateway Interface) to build dynamic web applications that connected legacy systems to this new interface. These early days of web development were like the Wild West—no established security practices, framework standards, or even consistent browser implementations existed. During a code review when working at a major credit card company, I discovered a shocking vulnerability: their web application stored usernames and passwords directly in cookies in plaintext, essentially exposing customer credentials to anyone with basic technical knowledge. These early web servers used a process-based concurrency model, spawning a new process for each request—inefficient by modern standards but there wasn’t much user traffic at the time. On the client side, I worked with the Netscape browser, while server implementations expanded to include Apache, Netscape Enterprise Server, and Microsoft’s IIS.

I also built my own Model-View-Controller architecture and templating system because there weren’t any established frameworks available. As Java gained traction, I migrated to JSP and the Struts framework, which formalized MVC patterns for web applications. This evolution continued as web servers evolved from process-based to thread-based concurrency models, and eventually to asynchronous I/O implementations in platforms like Nginx, dramatically improving scalability. Having witnessed the entire evolution—from hand-coded HTML to complex JavaScript frameworks—gives me a unique perspective on how rapidly this technology landscape has developed.


Distributed Systems Development

My journey with distributed systems began with Berkeley Sockets—the foundational API that enabled networked communication between applications. After briefly working with Sun’s RPC (Remote Procedure Call) APIs, I embraced Java’s Socket implementation and then its Remote Method Invocation (RMI) framework, which I used to implement remote services when working as a consultant for an enterprise client. RMI offered the revolutionary ability to invoke methods on remote objects as if they were local, handling network communication transparently and even dynamically loading remote classes. At a major travel booking company, I worked with Java’s JINI technology, which was inspired by Linda memory model and TupleSpace that I also studied during my postgraduate research. JINI extended RMI with service discovery and leasing mechanisms, creating a more robust foundation for distributed applications. I later used GigaSpaces, which expanded the JavaSpaces concept into a full in-memory data grid for session storage.

For personal projects, I explored Voyager, a mobile agent platform that simplified remote object interaction with dynamic proxies and mobile object capabilities. Despite its technical elegance, Voyager never achieved widespread adoption—a pattern I would see repeatedly with technically superior but commercially unsuccessful distributed technologies. While contracting for Intelligent Traffic Systems in the Midwest during the late 1990s, I implemented CORBA-based solutions that collected real-time traffic data from roadway sensors and distributed it to news agencies via a publish-subscribe model. CORBA promised language-neutral interoperability through its Interface Definition Language (IDL), but reality fell short—applications typically worked reliably only when using components from the same vendor. I had to implement custom interceptors to add the authentication and authorization capabilities CORBA lacked natively. Nevertheless, CORBA’s explicit interface definitions via IDL influenced later technologies like gRPC that we still use today. The Java Enterprise (J2EE) era brought Enterprise JavaBeans and I implemented these technologies using BEA WebLogic for another state highway system, and continued working with them at various travel, airline, and fintech companies. EJB’s fatal flaw was attempting to abstract away the distinction between local and remote method calls—encouraging developers to treat distributed objects like local ones. This led to catastrophic performance problems as applications made thousands of network calls for operations that should have been local.

I read Rod Johnson’s influential critique of EJB that eventually evolved into the Spring Framework, offering a more practical approach to Java enterprise development. Around the same time, I transitioned to simpler XML-over-HTTP designs before the industry standardized on SOAP and WSDL. The subsequent explosion of WS-* specifications (WS-Security, WS-Addressing, etc.) created such complexity that the diagram of their interdependencies resembled the Death Star. I eventually abandoned SOAP’s complexity for JSON over HTTP, implementing long-polling and Server-Sent Events (SSE) for real-time applications before adopting the REST architectural style that dominates today’s API landscape. Throughout these transitions, I integrated various messaging systems including IBM WebSphere MQ, JMS implementations, TIBCO Rendezvous, and Apache ActiveMQ to provide asynchronous communication capabilities. This journey through distributed systems technologies reflects a recurring pattern: the industry oscillating between complexity and simplicity, between comprehensive frameworks and minimal viable approaches. The technologies that endured longest were those that acknowledged and respected the fundamental challenges of distributed computing—network unreliability, latency, and the fallacies of distributed computing—rather than attempting to hide them behind leaky abstractions.


Client & Mobile Development

Terminal & Desktop GUI

My journey developing client applications began with CICS on mainframe systems—creating those distinctive green-screen interfaces for 3270 terminals once ubiquitous in banking and government environments. The 4th generation tools era introduced me to dBase and Paradox, which I used to build database-driven applications through their “query by example” interfaces, which allowed rapid development of forms and reports without extensive coding. For personal projects, I developed numerous DOS applications, games, and shareware using Borland Turbo C. As Windows gained prominence, I transitioned to building GUI applications using Borland C++ with OWL (Object Windows Library) and later Microsoft Foundation Classes (MFC), which abstracted the complex Windows API into an object-oriented framework. While working for a credit protection company, I developed UNIX-based client applications using OSF/Motif. Motif’s widget system and resource files offered sophisticated UI capabilities, though with considerable implementation complexity.


Web Clients

The web revolution transformed client development fundamentally. I quickly adopted HTML for financial and government projects, creating browser-based interfaces that eliminated client-side installation requirements. For richer interactive experiences, I embedded Flash elements into web applications—creating animations and interactive components beyond HTML’s capabilities at the time. Java’s introduction brought the promise of “write once, run anywhere,” which I embraced through Java applets that you could embed like Flash widgets. Later, Java Web Start offered a bridge between web distribution and desktop application capabilities, allowing applications to be launched from browsers while running outside their security sandbox. Using Java’s AWT and later Swing libraries, I built standalone applications including IRC and email clients. The client-side JavaScript revolution, catalyzed by Google’s demonstration of AJAX techniques, fundamentally changed web application architecture. I experimented with successive generations of JavaScript libraries—Prototype.js for DOM manipulation, Script.aculo.us for animations, YUI for more component sets, etc.


Embedded and Mobile Development

As Java had its roots in embedded/TV systems, it introduced a wearable smart Java ring with an embedded microchip that I used for some personal security applications. Though the Java Ring quickly disappeared from the market, its technological descendants like the iButton continued finding specialized applications in security and authentication systems. The mobile revolution began in earnest with the Palm Pilot—a breakthrough device featuring the innovative Graffiti handwriting recognition system that transformed how we interacted with portable computers. I embraced Palm development, creating applications for this pioneering platform and carrying a Palm device for years. As mobile technologies evolved, I explored the Wireless Application Protocol (WAP), which attempted to bring web content to the limited displays and bandwidth of early mobile phones but failed to gain widespread adoption. When Java introduced J2ME (Java 2 Micro Edition), I invested heavily in mastering this platform, attracted by its promise of cross-device compatibility across various feature phones. I developed applications targeting the constrained CLDC (Connected Limited Device Configuration) and MIDP (Mobile Information Device Profile) specifications.

The entire mobile landscape transformed dramatically when Apple introduced the first iPhone in 2007—a genuine paradigm shift that redefined our expectations for mobile devices. Recognizing this fundamental change, I learned iOS development using Objective-C with its message-passing syntax and manual memory management. This investment paid off when I developed an iOS application for a fintech company that significantly contributed to its acquisition by a larger trading firm. Early mobile development eerily mirrored my experiences with early desktop computing—working within severe hardware constraints that demanded careful resource management. Despite theoretical advances in programming abstractions, I found myself once again meticulously optimizing memory usage, minimizing disk operations, and carefully managing network bandwidth. This return to fundamental computing constraints reinforced my appreciation for efficiency-minded development practices that remain valuable even as hardware capabilities continue expanding.


Development Methodologies

My first corporate experience introduced me to Total Quality Management (TQM), with its focus on continuous improvement and customer satisfaction. This early exposure taught me a crucial lesson: methodology adoption depends more on organizational culture than on the framework itself. Despite new terminology and reorganized org charts, the company largely maintained its existing practices with superficial changes. Later, I worked with organizations implementing the Capability Maturity Model (CMM), which attempted to categorize development processes into five maturity levels. While this framework provided useful structure for improving chaotic environments, its documentation requirements and formal assessments often created bureaucratic overhead that impeded actual development. Similarly, the Rational Unified Process (RUP), which I used at several companies, offered comprehensive guidance but it turned into waterfall development model in many projects. The agile revolution emerged as a reaction against these heavyweight methodologies. I applied elements of Feature-Driven Development and Spiral methodologies when working at a major airline, focusing on iterative development and explicit risk management. I explored various agile approaches during this period—Crystal’s focus on team communication, Adaptive Software Development’s emphasis on change tolerance, and particularly Extreme Programming (XP), which introduced practices like test-driven development and pair programming that fundamentally changed how I approached code quality. Eventually, most organizations where I worked settled on customized implementations of Scrum and Kanban—frameworks that continue to dominate agile practice today.


Development Methodologies & Modeling

Earlier in my career, approaches like Rapid Application Development (RAD) and Joint Application Development (JAD) emphasized quick prototyping and intensive stakeholder workshops. These methodologies aligned with Computer-Aided Software Engineering (CASE) tools like Rational Rose and Visual Paradigm, which promised to transform software development through visual modeling and automated code generation. On larger projects, I spent months creating elaborate UML diagrams—use cases, class diagrams, sequence diagrams, and more. Some CASE tools I used could generate code frameworks from these models and even reverse-engineer models from existing code, promising a synchronized relationship between design and implementation. The reality proved disappointing; generated code was often rigid and difficult to maintain, while keeping models and code in sync became an exercise in frustration. The agile movement ultimately eclipsed both heavyweight methodologies and comprehensive CASE tools, emphasizing working software over comprehensive documentation.


DevOps Evolution

Version Control System

My introduction to version control came at a high-energy physics lab, where projects used primitive systems like RCS (Revision Control System) and SCCS (Source Code Control System). These early tools stored delta changes for each file and relied on exclusive locking mechanisms—only one developer could edit a file at a time. As development teams grew, most projects migrated to CVS (Concurrent Versions System), which built upon RCS foundations. CVS supported networked operations, allowing developers to commit changes from remote locations, and replaced exclusive locking with a more flexible concurrent model. However, CVS still operated at the file level rather than treating commits as project-wide transactions, leading to potential inconsistencies when only portions of related changes were committed. I continued using CVS for years until Subversion emerged as its logical successor. Subversion’s introduction of atomic commits to ensure that either all or none of a change would be committed. It also improved branching operations, directory management, and file metadata handling, addressing many of CVS’s limitations. While working at a travel company, I encountered AccuRev, which introduced the concept of “streams” instead of traditional branches. This approach modeled development as flowing through various stages. AccuRev proved particularly valuable for managing offshore development teams who needed to download large codebases over unreliable networks and its sophisticated change management reduced bandwidth requirements.

During my time at a large online retailer in the mid-2000s, I worked with Perforce, a system optimized for large-scale development with massive codebases and binary assets. Perforce’s performance with large files and sophisticated security model made it ideal for enterprise environments. I briefly used Mercurial for some projects, appreciating its simplified interface compared to early Git versions, before ultimately transitioning to Git as it became the industry standard. This evolution of version control parallels the increasing complexity of software development itself: from single developers working on isolated files to globally distributed teams collaborating on massive codebases.

Build Systems

I have been using Make probably throughout my career across various platforms and languages. Its declarative approach to defining dependencies and build rules established patterns that influence build tools to this day. After adopting Java ecosystem, I switched to Apache Ant, which used XML to define build tasks as an explicit sequence of operations. This offered greater flexibility and cross-platform consistency but at the cost of increasingly verbose build files as projects grew more complex. I used Ant extensively during Java’s enterprise ascendancy, customizing its tasks to handle deployment, testing, and reporting. I then adopted Maven that introduced revolutionary concepts such as convention-over-configuration philosophy with standardized project structures, dependency management capabilities connected to remote repositories to automatically resolve and download required libraries. Despite Maven’s transformative nature, its rigid conventions and complex XML configuration was a bit frustrating and I later switched to Gradle. Gradle offered Maven’s dependency management with a Groovy-based DSL that provided both the structure of declarative builds and the flexibility of programmatic customization.

The build process expanded beyond compilation when I implemented Continuous Integration using CruiseControl, an early CI server developed by ThoughtWorks. This system automatically triggered builds on code changes, ran tests, and reported results. Later, I worked extensively with Hudson, which offered a more user-friendly interface and plugin architecture for extending CI capabilities. When Oracle acquired Sun and attempted to trademark the Hudson name, the community rallied behind a fork called Jenkins, which rapidly became the dominant CI platform. I used Jenkins for years, creating complex pipelines that automated testing, deployment, and release processes across multiple projects and environments. Eventually, I transitioned to cloud-based CI/CD platforms that integrated more seamlessly with hosted repositories and containerized deployments.


Summary

As I look back across my three decades in technology, these obsolete systems and abandoned platforms aren’t just nostalgic relics—they tell a powerful story about innovation, market forces, and the unpredictable nature of technological evolution. The technologies I’ve described throughout this blog didn’t disappear because they were fundamentally flawed. Pascal offered cleaner syntax than C, BeOS was more elegant than Windows, and CORBA attempted to solve distributed computing problems we still grapple with today. Borland’s superior development tools lost to Microsoft’s ecosystem advantages. Object-oriented databases, despite solving real problems, couldn’t overcome the momentum of relational systems. Yet these extinct technologies left lasting imprints on our industry. Anders Hejlsberg, who created Turbo Pascal, went on to shape C# and TypeScript. The clean design principles of BeOS influenced aspects of modern operating systems. Ideas don’t die—they evolve and find new expressions in subsequent generations of technology.

Perhaps the most valuable lesson is about technological adaptability. Throughout my career, the skills that have remained relevant weren’t tied to specific languages or platforms, but rather to fundamental concepts: understanding data structures, recognizing patterns in system design, and knowing when complexity serves a purpose versus when it becomes a hurdle. The industry’s constant reinvention ensures that many of today’s dominant technologies will eventually face their own extinction event. By understanding the patterns of the past, we gain insight into which current technologies might have staying power. This digital archaeology isn’t just about honoring what came before—it’s about understanding the cyclical nature of our industry and preparing for what comes next.


September 21, 2024

Robust Retry Strategies for Building Resilient Distributed Systems

Filed under: API,Computing,Microservices — admin @ 10:50 am

Introduction

Distributed systems inherently involve multiple components such as services, databases, networks, etc., which are spread across different machines or locations. These systems are prone to partial failures, where one part of the system may fail while others remain operational. A common strategy for building fault-tolerant and resilient systems is to recover from transient failures by retrying failed operations. Here are some common use cases for implementing retries to maintain reliability in such environments:

  • Recover from Transient Failures such as network glitches, dropped packets, or temporary unavailability of services. These failures are often short-lived, and a simple retry may succeed without any changes to the underlying system.
  • Recover from Network Instability due to packet loss, latency, congestion, or intermittent connectivity can disrupt communication between services.
  • Recover from Load Shedding or Throttling where services may experience momentary overloads and are unable to handle incoming requests.
  • Asynchronous Processing or Eventual Consistency models may take time to converge state across different nodes or services and operations might fail temporarily if the system is in an intermediate state.
  • Fault Isolation in microservices architectures, where services are loosely coupled but depend on one another. The downstream services may fail temporarily due to a service restart, deployment or scaling activities.
  • Service Downtime affects availability of services but client application can use retries to recover from minor faults and maintain availability.
  • Load Balancing and Failover with redundant Zones/Regions so that when a request to one zone/region fails but can be handled by another healthy region or zone.
  • Partial Failures where one part of the system fails while the rest remains functional (partial failures).
  • Build System Resilience to allow the system to self-heal from minor disruptions.
  • Race Conditions or timing-related issues in concurrent systems can be resolved with retries.

Challenges with Retries

Retries help in recovering from transient or partial failures by resending requests, but they can worsen system overloads if not managed carefully. Here are some challenges associated with retries:

  • Retry Storms: A retry storm occurs when multiple clients or services simultaneously retry failed requests to an overloaded or recovering service. This flood of retries can exacerbate the problem and can lead to performance degradation or a self-inflicted Denial of Service (DoS) attack.
  • Idempotency and Data Consistency: Some operations are not idempotent and performing them multiple times can lead to inconsistent or incorrect results (e.g., processing a financial transaction multiple times).
  • Cascading Failures: Retrying can propagate failures upstream or to dependent services. For instance, when a service fails and clients retry excessively, which can overwhelm downstream services.
  • Latency Amplification: Retrying failed operations can increase end-to-end latency, as each retry adds a delay before successful resolution.
  • Amplified Resource Consumption: Retried operations consume additional CPU, memory, and bandwidth, potentially depleting resources at a faster rate. Even when services eventually succeed, the increased load from retries can harm the overall system.
  • Retry Loops or Infinite Retries: If a failed operation is retried continuously without ever succeeding, it can potentially lead to system crashes.
  • Threads and connections starvation: When a service invokes multiple operations and some fail, it may retry all operations, leading to increased overall request latency. If high timeouts are set, threads and connections remain occupied, blocking new traffic.
  • Unnecessary Retries on Non-Retryable Failures: Retrying certain types of failures, like authorization errors or malformed requests is unnecessary and wastes system resources.
  • Timeout Mismatch Between Services: If the timeout settings for retries between services are not aligned, a downstream service may still be processing a request while the upstream service retries or times out that can result in conflicting states.

Considerations for Retries

Here are some key considerations and best practices for implementing more effective and safer retry mechanisms in distributed systems, enhancing resilience while safeguarding system stability during periods of stress or failure:

  • Timeouts: Implement timeouts to prevent clients from waiting indefinitely for a response and reduce resource exhaustion (e.g., memory or threads) caused by prolonged waiting. The challenge lies in selecting the appropriate timeout value: if set too high, resources are wasted; if set too low, it can trigger excessive retries, which increases the risk of outages. It’s recommended to set timeouts that are tightly aligned with performance expectations, ideally less than 2-times your maximum response time to avoid thread starvation. Additionally, monitor for early warning signs by setting alarms when performance degrades (e.g., when P99 latency approaches 50% of the timeout value).
  • Timeout Budgeting: In complex distributed systems, timeout budgeting ensures that the total time taken by a request across multiple services doesn’t exceed an acceptable limit. Each downstream service gets a portion of the total timeout, so failure in one service doesn’t excessively delay the entire request chain.
  • Exponential Backoff: Implement exponential backoff to spread out retry attempts by gradually increasing the delay between retries, reducing the risk of overwhelming a failing component and allowing time for recovery. It’s important to cap the backoff duration and limit the total number of retries. Without these limits, the system might continue retrying unnecessarily even after the underlying issue has been resolved.
  • Jitter: Adding randomness (jitter) to the backoff process helps prevent synchronized retries that could lead to overload spikes. Jitter is useful for spreading out traffic spikes and periodic tasks to avoid large bursts of traffic at regular intervals for improving system stability.
  • Idempotency: Operations that are retried must be idempotent, meaning they can be safely repeated without causing unintended side effects (e.g., double payments or duplicated data).
  • Retry Limits: Retries should be capped at a certain limit to avoid endlessly retrying a failing operation. Retries should stop beyond a certain number of attempts and the failure should be escalated or reported.
  • Throttling and Rate Limiting: Implement throttling or rate limiting and control the number of requests a service handles within a given time period. Rate limiting can be dynamic, which is adjusted based on current load or error rates, and avoid system overloads during traffic spikes. In addition, low-priority requests can be shed during high load situations.
  • Error Categorization: Not all errors should trigger retries and use an allowlist for known retryable errors and only retry those. For example, 400 Bad Request (indicating a permanent client error) due to invalid input should not be retried, while server-side or network-related errors with a 500 Internal Server Error (a likely transient issue) can benefit from retrying.
  • Targeting Failing Components Only: In a partial failure, not all parts of the system are down and retries help isolate and recover from the failing components by retrying operations specifically targeting the failed resource. For example, if a service depends on multiple microservices for an operation and one of the service fails, the system should retry the failed request without repeating the entire operation.
  • Intelligent and Adaptive Retries: Design retry logic to take the system’s current state into account, such as checking service health or load conditions before retrying. For example, increase retry intervals if multiple components are detected as failing or retry quickly for timeout errors but back off more for connection errors.. This prevents retries when the system is already known to be overloaded.
  • Retrying at Different Levels: Retries can be implemented at various levels to handle partial failures such as application level, middleware/proxy (load-balancer or API gateway), transport level (network). For example, a distributed system using a load balancer can detect if a specific instance of a service is failing and reroute traffic to a healthy instance that triggers retries only for the requests that target the failing instance.
  • Retry Amplification: In multi-tiered architectures, if retries are implemented at each level of nested service calls, it can lead to increased latency and exponentially higher traffic. To mitigate this, implement retries only at critical points in the call chain, and ensure that each service has a clear retry policy with limits. Use short timeouts to prevent thread starvation when calls to downstream services take too long. If too many threads hang, new traffic will be blocked.
  • Retry Budget: Implementing a global limit on the number of retries across all operations helps prevent system overload. For example, using an algorithm like Leaky Bucket can regulate the number of retries within a specified time period. This ensures that retries are distributed evenly and don’t exceed system capacity, preventing resource exhaustion during high failure rates.
  • Retries with Circuit Breakers: The circuit breaker pattern can be combined with retries to avoid overwhelming a failing component. When a service starts failing, the circuit breaker opens, temporarily halting requests to that service until it is healthy again. Retries can be configured to happen only after the circuit breaker transitions to a half-open state, which allows a limited number of retries to test if the service has recovered.
  • Retries with Failover Mechanisms: Retries can be designed with failover strategies where the system switches to a backup service, region, or replica in case of partial failure. If a service in one region fails then the retries can redirect requests to a different region or zone for ensuring availability.
  • Latency Sensitivity: Services with strict latency requirements might not tolerate long backoff periods or extended retries so they should minimize number of retries and cap backoff times.
  • Sync Calls: For synchronous calls, retry once immediately to handle temporary network issues and avoid multiple retries that could lead to thread starvation. Avoid excessive sleeping of threads between retries, which can lead to thread starvation. Also, a Circuit Breaker can be used to prevent retrying if a high percentage of calls fail.
  • Async Calls: Use exponential backoff with jitter for asynchronous operations and use Circuit Breakers to stop retries when failure rates are high. Asynchronous APIs can queue requests for later retries, but should incorporate health checks to ensure that retry attempts don’t add excessive load to downstream services during recovery periods.
  • Retrying on Overload Responses: Recognize overload indicators (e.g., HTTP 503 responses) and avoid retries when the response indicates overload.
  • Fail-Fast: Detect issues early and fails quickly rather than continuing to process failing requests or operations to avoid wasting time on requests that are unlikely to succeed.
  • Graceful Degradation: Provide an alternative method of handling requests when a service fails. For example, if a primary service is down, a cached result or a simpler backup service can be used instead.
  • Downstream Bugs: Rather than implementing retry-based workarounds, prioritize having downstream service owners address and resolve the underlying issues.
  • Monitor and Analyze Retry Patterns: Implement monitoring for retry attempts and success rates, and analyze the data to gain insights into system behavior during failures. Use these insights to optimize retry strategies, such as adjusting backoff intervals and fine-tuning timeouts for improved system performance.
  • SLAs with Downstream Services: Establish clear service-level agreements (SLAs) with downstream services about call frequency, failure rates, and latency expectations.
  • Availability Over Consistency: Prioritize service availability over consistency where possible, especially during retries or failure handling. In such cases, retries might return stale data or cause inconsistency issues, so it’s crucial to align retry policies with system design.
  • Chaos Engineering: Chaos engineering involves intentionally injecting failures, such as server crashes or network disruptions, into a system to test its resilience under adverse conditions. By simulating real-world failures, teams can identify weaknesses and ensure that the retry policies are working as expected.
  • Bulkhead Pattern: The bulkhead pattern isolates different parts of a system to prevent a failure in one part from affecting the rest of the system. The bulkheads can be implemented by limiting the number of resources (threads, memory, connections) allocated to each service or subsystem so that if one service becomes overloaded or fails, it won’t exhaust resources that other services need.
  • System Design: It’s essential to design APIs to minimize unnecessary communication with the server. For instance, in an event-driven architecture, if an event is missing a required attribute, the application might need to make additional requests to retrieve that data, increasing system load. To avoid this, ensure that events are fully populated with all necessary information upfront.

Summary

Retries are an essential mechanism for building fault-tolerant distributed systems and to recover from transient failures such as network issues, service unavailability, and partial system outages. A well-implemented retry strategy improves system resilience by ensuring that temporary failures don’t lead to full-blown outages. Techniques such as exponential backoff with jitter, idempotency, token buckets to limit retries locally, and circuit breakers help manage retries effectively, preventing issues like retry storms, resource exhaustion, and latency amplification.

However, retries need careful management because without proper limits, retries can overwhelm services that are already struggling or exacerbate issues like cascading failures and thread starvation. Incorporating timeouts, retry limits, and adaptive retry mechanisms based on system health can prevent these negative side effects. By analyzing retry patterns and adopting error-specific handling strategies, distributed systems can strike a balance between availability and resource efficiency, and ensures robust performance even in the face of partial failures.

August 28, 2024

From Code to Production: A Checklist for Reliable, Scalable, and Secure Deployments

Filed under: API,Software Release,Testing — admin @ 9:19 pm

Building and maintaining distributed systems is challenging due to complex intricacies of production environments, configuration differences, data and traffic scaling, dependencies on third-party services, and unpredictable usage patterns. These factors can lead to outages, security breaches, performance degradation, data inconsistencies, and other operational issues that may negatively impact customers [See Architecture Patterns and Well-Architected Framework]. These risks can be mitigated with phased rollouts with canary releases, leveraging feature flags for controlled feature activation, and ensuring comprehensive observability through monitoring, logging, and tracing are crucial. Additionally, rigorous scalability testing, including load and chaos testing, and proactive security testing are necessary to identify and address potential vulnerabilities. The use of blue/green deployments and the ability to quickly roll back changes further enhance the resilience of your system. Beyond these strategies, fostering a DevOps culture that emphasizes collaboration between development, operations, and security teams is vital. The following checklist serves as a guide to verify critical areas that may go awry when deploying code to production, helping teams navigate the inherent challenges of distributed systems.

Build Pipelines

  • Separate Pipelines: Create distinct CI/CD pipelines for each microservice, including infrastructure changes managed through IaC (Infrastructure as Code). Also, set up a separate pipeline for config changes such as throttling limits or access policies.
  • Securing and Managing Dependencies: Identify and address deprecated and vulnerable dependencies during the build process and ensure third party dependencies are vetted and hosted internally.
  • Build Failures: Verify build pipelines with comprehensive suite of unit and integration tests, and promptly resolve any flaky tests caused by concurrency, networking, or other issues.
  • Automatic Rollback: Automatically roll back changes if sanity tests or alarm metrics fail during the build process.
  • Phased Deployments: Deploy new changes in phases gradually across multiple data centers using canary testing with adequate baking period to validate functional and non-functional behavior. Immediately roll back and halt further deployments if error rates exceed acceptable thresholds [See Mitigate Production Risks with Phased Deployment].
  • Avoid Risky Deployments: Deploy changes during regular office hours to ensure any issues can be promptly addressed. Avoid deploying code during outages, availability issues, when 20%+ hosts are unhealthy, or during special calendar days like holidays or peak traffic periods.

Code Analysis and Verification

API Testing and Analysis

Security Testing

Recommended practices for security testing [See Security Challenges in Microservice Architecture]:

  • IAM Best Practices: Follow IAM best practices such as using multi-factor authentication (MFA), regularly rotating credentials and encryption keys, and implementing role-based access control (RBAC).
  • Authentication and Authorization: Verify that authentication and authorization policies adhere to the principle of least privilege.
  • Defense in Depth: Implement admission controls at every layer including network, application and data.
  • Vulnerability & Penetration Testing: Conduct security tests targeting vulnerabilities based on the threat model for the service’s functionality.
  • Encryption: Implement encryption at rest and in-transit policies.
  • Security Testing Tools: Use tools like OWASP ZAP, Nessus, Acunetix, Qualys, Synk and Burp Suite for security testing [See OWASP Top Ten, CWE TOP 25].

Loading Testing

  • Test Plan: Ensure test plan accurately simulate real use cases, including varying data sizes and read/write operations.
  • Scalability Assessment: Conduct load tests to assess the scalability of both your primary service and its dependencies.
  • Testing Strategies: Conduct load tests using both mock dependent services and real services to identify potential bottlenecks.
  • Resource Monitoring: During load testing, monitor for excessive logs, events, and other resources, and assess their impact on latency and potential bottlenecks.
  • Autoscaling Validation: Validate on-demand autoscaling policies by testing them under increased load conditions.

Chaos Testing

Chaos testing involves injecting faults into the system to test its resilience and ensure it can recover gracefully [See Fault Injection Testing and Mocking and Fuzz Testing].

  • Service Unavailability: Test scenarios where the dependent service is unavailable, experiences high latency, or results in a higher number of faults.
  • Monitoring and Alarms: Ensure that monitoring, alarms and on-call procedures for troubleshooting and recovery are functioning as intended.

Canary Testing and Continuous Validation

This strategy involves deploying a new version of a service to a limited subset of users or servers with real-time monitoring and validation before a full deployment.

  • Canary Test Validation: Ensure canary tests based on real use cases and validate functional and non-functional behavior of the service. If a canary test fails, it should automatically trigger a rollback and halt further deployments until the underlying issues are resolved.
  • Continuous Validation: Continuously validate API behavior and monitor performance metrics such as latency, error rates, and resource utilization.
  • Edge Case Testing: Canary tests should include common and edge cases such as large request size.

Resilience and Reliability

  • Idle Timeout Configuration: Set your API server’s idle connection timeout slightly longer than the load balancer’s idle timeout.
  • Load Balancer Configuration: Ensure the load balancer evenly distributes requests among servers using a round-robin method and avoids directing traffic to unhealthy hosts. Prefer this approach over least-connections method.
  • Backward Compatibility: Ensure API changes are backward compatible that are verified through Contract-based testing, and forward compatible by ignoring unknown properties.
  • Correlation ID Injection: Inject a Correlation ID into incoming requests, allowing it to be propagated through all dependent services for logging and tracing purposes.
  • Graceful Degradation: Implement graceful degradation to operate in a limited capacity even when dependent services are down.
  • Idempotent APIs: Ensure APIs especially those that create resources are implemented with idempotent behavior.
  • Request Validation: Validate all request parameters and fail fast any requests that are malformed, improperly sized, or contain malicious data.
  • Single Points of Failure: Eliminate single points of failure, bottlenecks, and dependencies on shared resources to minimize the blast radius.
  • Cold Start Optimization: Ensure that cold service startup time is limited to just a few seconds.

Performance Optimization

  1. Latency Reduction: Identify and optimize parts of the system with high latency, such as database queries, network calls, or computation-heavy tasks.
  2. Pagination: Implement pagination for list operations, ensuring that pagination tokens are account-specific and invalid after the query expiration time.
  3. Thread and Queue Management: Set up the number of threads, connections, and queuing limits. Generally, the queue size should be proportional to the number of threads and kept small.
  4. Resource Optimization: Optimize resource usage (e.g., CPU, memory, disk) by tuning configuration settings and optimizing code paths to reduce unnecessary overhead.
  5. Caching Strategy: Review and optimize caching strategies to reduce load on databases and services, ensuring that cached data is used effectively without becoming stale.
  6. Database Indexing: Regularly review and update database indexing strategies to ensure queries run efficiently and data retrieval is optimized.

Throttling and Rate Limiting

Below are some best practices for throttling and rate limiting [See Effective Load Shedding and Throttling Strategies]:

  • Web Application Firewall: Consider implementing Web application firewall integration with your services’ load balancers to enhance security, traffic management and protect against distributed denial-of-service (DDoS). Confirm WAF settings and assess performance through load and security testing.
  • Testing Throttling Limits: Test throttling and rate limiting policies in the test environment.
  • Granular Limits: Implement tenant-level rate limits at the API endpoint level to prevent the noisy neighbor problem, and ensure that tenant context is passed to downstream services to enforce similar limits.
  • Aggregated Limits: When setting rate limits for both tenant-level and API-levels, ensure that the tenant-level limits exceed the combined total of all API limits.
  • Graceful degradation: Cache throttling and rate limit data to enable graceful degradation with fail-open if datastore retrieval fails.
  • Unauthenticated requests: Minimize processing for unauthenticated requests and safeguard against large payloads and invalid parameters.

Dependent Services

  • Timeout and Retry Configuration: Configure connection and request timeouts, implement retries with backoff and circuit-breaker, and set up fallback mechanisms for API clients with circuit breakers when connecting to dependent services.
  • Monitoring and Logging: Monitor and log failures and latency of dependent services and infrastructure components such as load balancers, and trigger alarms when they exceed the defined SLOs.
  • Scalability of Dependent Service: Verify that dependent services can cope with increased traffic loads during scaling traffic.

Compliance and Privacy

Below are some best practices for ensuring compliance:

  • Compliance: Ensure all data compliance to local regulations such as California Consumer Privacy Act (CCPA), General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and other privacy regulations [See NIST SP 800-122].
  • Privacy: Identify and classify Personal Identifiable Information (PII), and ensure all data access is protected through Identity and Access Management (IAM) and compliance based PII policies [See DHS Guidance].
  • Privacy by design: Incorporate privacy by design principles into every stage of development to reduce the risk of data breaches.
  • Audit Logs: Maintain logs for all administrative actions, access to sensitive data and changes to critical configurations for compliance audit trails.
  • Monitoring: Continuously monitor of compliance requirements to ensure ongoing adherence to regulations.

Data Management

  • Data Consistency: Evaluate requirements for the data consistency such as strong and eventual consistency. Ensure data is consistently stored across multiple data stores, and implement a reconciliation process to detect any inconsistencies or lag times, logging them for monitoring and alerting purposes.
  • Schema Compatibility: Ensure data schema changes are both backward and forward compatible by implementing a two-phase release process. First, deploy an intermediate version that can read the new schema format but continues to write in the old format. Once this intermediate version is fully deployed and stable, proceed to roll out the new code that writes data in the new format.
  • Retention Policies: Establish and verify data retention policies across all datasets.
  • Unique Data IDs: Ensure data IDs are unique and do not overflow especially when using 32-bit or smaller integers.
  • Auto-scaling Testing: Test auto-scaling policies triggered by traffic spikes, and confirm proper partitioning/sharding across scaled resources.
  • Data Cleanup: Clean up stale data, logs and other resources that have expired or are no longer needed.
  • Divergence Monitoring: Implement automated processes to identify divergence from data consistency or high lag time with data synchronization when working with multiple data stores.
  • Data Migration Testing: Test data migrations in isolated environments to ensure they can be performed without data loss or corruption.
  • Backup and Recovery: Test backup and recovery processes to confirm they meet defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets.
  • Data Masking: Implement data masking in non-production environments to protect sensitive information.

Caching

Here are some best practices for caching strategies [See When Caching is not a Silver Bullet]:

  • Stale Cache Handling: Handle stale cache data by setting appropriate time-to-live (TTL) values and ensuring cache invalidation is correctly implemented.
  • Cache Preloading: Pre-load cache before significant traffic spikes so that latency can be minimized.
  • Cache Validation: Validate the effectiveness of your cache invalidation and clearing methods.
  • Negative Cache: Implement caching behavior for both positive and negative use cases and monitor the cache hits and misses.
  • Peak Traffic Testing: Assess service performance under peak traffic conditions without caching.
  • Bimodal Behavior: Minimize reliance on caching to reduce the complexity of bimodal logic paths.

Disaster Recovery

  1. Backup Validation: Regularly test backup and recovery processes to ensure they meet defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets.
  2. Failover Testing: Test failover procedures for critical services to validate that they can seamlessly switch over to backup systems or regions without service disruption.
  3. Chaos Engineering: Incorporate chaos engineering practices to simulate disaster scenarios and validate the resilience of your systems under failure conditions.

Configuration and Feature-Flags

  • Configuration Storage: Prefer storing configuration changes in a source code repository and releasing them gradually through a deployment pipeline including tests for verification.
  • Configuration Validation: Validate configuration changes in a test environment before applying them in production to avoid misconfigurations that could cause outages.
  • Feature Management: Use a centralized feature flag management system to maintain consistency across environments and easily roll back features if necessary.
  • Testing Feature Flags: Test every combination of feature flags comprehensively in both test and pre-production environments before the release.

Observability

Observability allows instrumenting systems to collect and analyze logs metrics and trace for monitoring system performance and health. Below are some best practices for monitoring, logging, tracing and alarms [See USE and RED methodologies for Systems Performance]:

Monitoring

  1. System Metrics: Monitor key system metrics such as CPU usage, memory usage, disk I/O, network latency, and throughput across all nodes in your distributed system.
  2. Application Metrics: Track application-specific metrics like request latency, error rates, throughput, and the performance of critical application functions.
  3. Server Faults and Client Errors: Monitor metrics for server-side faults (5XX) and client-side errors (4XX) including those from dependent services.
  4. Service Level Objectives (SLOs): Define and monitor SLOs for latency, availability, and error rates. Use these to trigger alerts if the system’s performance deviates from expected levels.
  5. Health Checks: Implement regular health checks to assess the status of services and underlying infrastructure, including database connections and external dependencies.
  6. Dashboards: Use dashboards to display real-time and historical graphs for throughput, P9X latency, faults/errors, data size, and other service metrics, with the ability to filter by tenant ID.

Logging

  1. Structured Logging: Ensure logs are structured and include essential information such as timestamps, correlation IDs, user IDs, and relevant request/response data.
  2. Log API entry and exits: Log the start and completion of API invocations along with correlation IDs for tracing purpose.
  3. Log Retention: Define and enforce log retention policies to avoid storage overuse and ensure compliance with data regulations.
  4. Log Aggregation: Use log aggregation tools to centralize logs from different services and nodes, making it easier to search and analyze them in real-time.
  5. Log Levels: Properly categorize logs (e.g., DEBUG, INFO, WARN, ERROR) and ensure sensitive information (such as PII) is not logged.

Tracing

  1. Distributed Tracing: Implement distributed tracing to capture end-to-end latency and the flow of requests across multiple services. This helps in identifying bottlenecks and understanding dependencies between services.
  2. Trace Sampling: Use trace sampling to manage the volume of tracing data, capturing detailed traces for a subset of requests to balance observability and performance.
  3. Trace Context Propagation: Ensure that trace context (e.g., trace IDs, span IDs) is propagated across all services, allowing complete trace reconstruction.

Alarms

  1. Threshold-Based Alarms: Set up alarms based on predefined thresholds for key metrics such as CPU/memory/disk/network usage, latency, error rates, throughput, starvation of threads and database connections, etc. Ensure that alarms are actionable and not too sensitive to avoid alert fatigue.
  2. Anomaly Detection: Implement anomaly detection to identify unusual patterns in metrics or logs that might indicate potential issues before they lead to outages.
  3. Metrics Isolation: Keep metrics and alarms from continuous canary tests and dependent services separate from those generated by real traffic.
  4. On-Call Rotation: Ensure that alarms trigger appropriate notifications to on-call personnel, and maintain a rotation schedule to distribute the on-call load among team members.
  5. Runbook Integration: Include runbooks with alarms to provide on-call engineers with guidance on how to investigate and resolve issues.

Rollback and Roll Forward

Rolling back involves redeploying a previous version to undo unwanted changes. Rolling forward involves pushing a new commit with the fix and deploying it. Here are some best practices for rollback and roll forward:

  • Immutable infrastructure: Implement immutable infrastructure practices so that switching back to a previous instance is simple.
  • Automated Rollbacks: Ensure rollbacks are automated so that they can be executed quickly and reliably without human intervention.
  • Rollback Testing: Test rollback changes in a test environment to ensure the code and data can be safely reverted.
  • Critical bugs: To prevent customer impact, avoid rolling back if the changes involve critical bug fixes or compliance and security-related updates.
  • Schema changes: If the new code introduced schema changes, confirm that the previous version can still read and update the modified data.
  • Roll Forward: Use rolling forward when rollback isn’t possible.
  • Avoid rushing Roll Forwards: Avoid roll forward if other changes have been committed that still being tested.
  • Testing Roll Forwards: Make sure the new changes including configuration updates are thoroughly tested before the roll forward.

Documentation and Knowledge Sharing

  1. Operational Runbooks: Maintain comprehensive runbooks that document operational procedures, troubleshooting steps, and escalation paths for common issues.
  2. Postmortems: Conduct postmortems after incidents to identify root causes, share lessons learned, and implement corrective actions to prevent recurrence.
  3. Knowledge Base: Build and maintain a knowledge base with documentation on system architecture, deployment processes, testing strategies, and best practices for new team members and ongoing reference.
  4. Training and Drills: Regularly train the team on disaster recovery procedures, runbooks, and incident management. Conduct disaster recovery drills to ensure readiness for actual incidents.

Continuous Improvement

  1. Feedback Loops: Establish feedback loops between development, operations, and security teams to continuously improve deployment processes and system reliability.
  2. Metrics Review: Regularly review metrics, logs, and alarms to identify trends, optimize configurations, and enhance system performance.
  3. Automation: Automate repetitive tasks, such as deployments, monitoring setup, and incident response, to reduce human error and increase efficiency.

Conclusion

Releasing software in distributed systems presents unique challenges due to the complexity and scale of production environments, which cannot be fully replicated in testing. By adhering to the practices outlined in this checklist—such as canary releases, feature flags, comprehensive observability, rigorous scalability testing, and well-prepared rollback mechanisms—you can significantly reduce the risks associated with deploying new code. A strong DevOps culture, where development, operations, and security teams work closely together, ensures continuous improvement and adaptability to new challenges. By following this checklist and fostering a culture of collaboration, you can enhance the stability, security, and scalability of each release for your platform.

« Newer PostsOlder Posts »

Powered by WordPress