Computing « Shahzad Bhatti

August 25, 2025

Beyond Vibe Coding: Using TLA+ and Executable Specifications with Claude

Filed under: Computing,Uncategorized — admin @ 9:45 pm

TL;DR: The Problem and Solution

Problem: AI-assisted coding fails when modifying existing systems because we give AI vague specifications.

Solution: Use TLA+ formal specifications as precise contracts that Claude can implement reliably.

Result: Transform Claude from a code generator into a reliable engineering partner that reasons about complex systems.

After months of using Claude for development, I discovered most AI-assisted coding fails not because the AI isn’t smart enough, but because we’re asking it to work from vague specifications. This post shows you how to move beyond “vibe coding” using executable specifications that turn Claude into a reliable engineering partner.

Here’s what changes when you use TLA+ with Claude:

Before (Vibe Coding):

“Create a task management API”
Claude guesses at requirements
Inconsistent behavior across edge cases
Bugs in corner cases

After (TLA+ Specifications):

Precise mathematical specification
Claude implements exactly what you specified
All edge cases defined upfront
Properties verified before deployment

The Vibe Coding Problem

AI assistants like Claude are primarily trained on greenfield development patterns. They excel at:

Writing new functions from scratch
Implementing well-known algorithms
Creating boilerplate code

But they struggle with:

Understanding implicit behavioral contracts in existing code
Maintaining invariants across system modifications
Reasoning about state transitions and edge cases
Preserving non-functional requirements (performance, security, etc.)

The solution isn’t better prompts – it’s better specifications.

Enter Executable Specifications

An executable specification is a formal description of system behavior that can be:

Verified – Checked for logical consistency
Validated – Tested against real-world scenarios
Executed – Run to generate test cases or even implementations

I’ve tried many approaches to precise specifications over the years:

UML and Model Driven Development (2000s-2010s): I used tools like Rational Rose and Visual Paradigm in early 2000s that promised complete code generation from UML models. The reality was different:

Visual complexity: UML diagrams became unwieldy for anything non-trivial
Tool lock-in: Proprietary formats and expensive tooling
Impedance mismatch: The gap between UML models and real code was huge
Maintenance nightmare: Keeping models and code synchronized was nearly impossible
Limited expressiveness: UML couldn’t capture complex behavioral contracts

BDD and Gherkin (mid-2000s): I used BDD and Gherkin in mid 2000s, which were better than UML for behavioral specifications, but still limited:

Structured natural language: Readable but not truly executable
No logical reasoning: Couldn’t catch design contradictions
Testing focused: Good for acceptance criteria, poor for system design

TLA+ (present): Takes executable specifications to their logical conclusion:

Mathematical precision: Eliminates ambiguity completely
Model checking: Explores all possible execution paths
Tool independence: Plain text specifications, open source tools
Behavioral focus: Designed specifically for concurrent and distributed systems

Why TLA+ with Claude?

The magic happens when you combine TLA+’s precision with Claude’s implementation capabilities:

TLA+ eliminates ambiguity – There’s only one way to interpret a formal specification
Claude can read TLA+ – It understands the formal syntax and can translate it to code
Verification catches design flaws – TLA+ model checking finds edge cases you’d miss
Generated traces become tests – TLA+ execution paths become your test suite

Setting Up Your Claude and TLA+ Environment

Installing Claude Desktop

First, let’s get Claude running on your machine:

# Install via Homebrew (macOS)
brew install --cask claude

# Or download directly from Anthropic
# https://claude.ai/download

Set up project-specific contexts in ~/.claude/
Create TLA+ syntax rules for better code generation
Configure memory settings for specification patterns

Configuring Your Workspace

Once installed, I recommend creating a dedicated workspace structure. Here’s what works for me:

# Create a Claude workspace directory
mkdir -p ~/claude-workspace/{projects,templates,context}

# Add a context file for your coding standards
cat > ~/claude-workspace/context/coding-standards.md << 'EOF'
# My Coding Standards

- Use descriptive variable names
- Functions should do one thing well
- Write tests for all new features
- Handle errors explicitly
- Document complex logic
EOF

Installing TLA+ Tools

Choose based on your workflow

GUI users: TLA+ Toolbox for visual model checking
CLI users: tla2tools.jar for CI integration
Both: VS Code extension for syntax highlighting

# Download TLA+ Tools from https://github.com/tlaplus/tlaplus/releases
# Or use Homebrew on macOS
brew install --cask tla-plus-toolbox

# For command-line usage (recommended for CI)
wget https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar

VS Code Extension

Install the TLA+ extension for syntax highlighting and basic validation:

code --install-extension alygin.vscode-tlaplus

Your First TLA+ Specification

Let’s start with a simple example to understand the syntax:

--------------------------- MODULE SimpleCounter ---------------------------
VARIABLE counter

Init == counter = 0

Increment == counter' = counter + 1

Decrement == counter' = counter - 1

Next == Increment \/ Decrement

Spec == Init /\ [][Next]_counter

TypeInvariant == counter \in Int

=============================================================================

This specification defines:

State: A counter variable
Initial condition: Counter starts at 0
Actions: Increment or decrement operations
Next state relation: Either action can occur
Invariant: Counter is always an integer

Real-World Example: Task Management API

Now let’s build something real. We’ll create a task management API using TLA+ specifications that Claude can implement in Go.

Step 1: Define the System State

First, we model what our system looks like (TaskManagement.tla):

--------------------------- MODULE TaskManagement ---------------------------
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Users,          \* Set of users
    MaxTasks,       \* Maximum number of tasks
    MaxTime,        \* Maximum time value for simulation
    Titles,         \* Set of possible task titles
    Descriptions    \* Set of possible task descriptions

VARIABLES
    tasks,          \* Function from task ID to task record
    userTasks,      \* Function from user ID to set of task IDs
    nextTaskId,     \* Counter for generating unique task IDs
    currentUser,    \* Currently authenticated user
    clock,          \* Global clock for timestamps
    sessions        \* Active user sessions

\* Task states enumeration with valid transitions
TaskStates == {"pending", "in_progress", "completed", "cancelled", "blocked"}

\* Priority levels
Priorities == {"low", "medium", "high", "critical"}

\* Valid state transitions
ValidTransitions == {
    <<"pending", "in_progress">>,
    <<"pending", "cancelled">>,
    <<"pending", "blocked">>,
    <<"in_progress", "completed">>,
    <<"in_progress", "cancelled">>,
    <<"in_progress", "blocked">>,
    <<"in_progress", "pending">>,      \* Allow reverting to pending
    <<"blocked", "pending">>,
    <<"blocked", "in_progress">>,
    <<"blocked", "cancelled">>
}


TaskRecord == [
    id: Nat,
    title: STRING,
    description: STRING,
    status: TaskStates,
    priority: {"low", "medium", "high"},
    assignee: Users,
    createdAt: Nat,
    dueDate: Nat \cup {NULL}
]

\* Type invariants
TypeInvariant == 
    /\ tasks \in [Nat -> TaskRecord]
    /\ userTasks \in [Users -> SUBSET Nat]
    /\ nextTaskId \in Nat
    /\ currentUser \in Users \cup {NULL}

Step 2: Define System Actions

Now we specify what operations are possible (TaskManagement.tla):

\* System initialization
Init ==
    /\ tasks = [i \in {} |-> CHOOSE x : FALSE]  \* Empty function
    /\ userTasks = [u \in Users |-> {}]
    /\ nextTaskId = 1
    /\ currentUser = "NULL"
    /\ clock = 0
    /\ sessions = [u \in Users |-> FALSE]

\* User authentication
Authenticate(user) ==
    /\ user \in Users
    /\ ~sessions[user]  \* User not already logged in
    /\ currentUser' = user
    /\ sessions' = [sessions EXCEPT ![user] = TRUE]
    /\ UNCHANGED <<tasks, userTasks, nextTaskId, clock>>

\* Create a new task
CreateTask(title, description, priority, dueDate) ==
    /\ currentUser # NULL
    /\ nextTaskId <= MaxTasks
    /\ LET newTask == [
           id |-> nextTaskId,
           title |-> title,
           description |-> description,
           status |-> "pending",
           priority |-> priority,
           assignee |-> currentUser,
           createdAt |-> nextTaskId, \* Simplified timestamp
           dueDate |-> dueDate
       ] IN
       /\ tasks' = tasks @@ (nextTaskId :> newTask)
       /\ userTasks' = [userTasks EXCEPT ![currentUser] = @ \cup {nextTaskId}]
       /\ nextTaskId' = nextTaskId + 1
       /\ UNCHANGED currentUser

\* Update task status
UpdateTaskStatus(taskId, newStatus) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ newStatus \in TaskStates
    /\ tasks' = [tasks EXCEPT ![taskId].status = newStatus]
    /\ UNCHANGED <<userTasks, nextTaskId, currentUser>>

\* Delete a task
DeleteTask(taskId) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ tasks' = [id \in (DOMAIN tasks \ {taskId}) |-> tasks[id]]
    /\ userTasks' = [userTasks EXCEPT ![currentUser] = @ \ {taskId}]
    /\ UNCHANGED <<nextTaskId, currentUser>>

Step 3: Safety and Liveness Properties

TLA+ shines when defining system properties (TaskManagement.tla):

\* Safety properties
NoOrphanTasks ==
    \A taskId \in DOMAIN tasks :
        \E user \in Users : taskId \in GetUserTasks(user)

TaskOwnership ==
    \A taskId \in DOMAIN tasks :
        tasks[taskId].assignee \in Users /\
        taskId \in GetUserTasks(tasks[taskId].assignee)

ValidTaskIds ==
    \A taskId \in DOMAIN tasks : 
        /\ taskId < nextTaskId
        /\ taskId >= 1

NoDuplicateTaskIds ==
    \A t1, t2 \in DOMAIN tasks :
        t1 = t2 \/ tasks[t1].id # tasks[t2].id

ValidStateTransitionsInvariant ==
    \A taskId \in DOMAIN tasks :
        tasks[taskId].status \in TaskStates

ConsistentTimestamps ==
    \A taskId \in DOMAIN tasks :
        /\ tasks[taskId].createdAt <= tasks[taskId].updatedAt
        /\ tasks[taskId].updatedAt <= clock

NoCyclicDependencies ==
    LET
        \* Transitive closure of dependencies
        RECURSIVE TransitiveDeps(_)
        TransitiveDeps(taskId) ==
            IF ~TaskExists(taskId) THEN {}
            ELSE LET directDeps == tasks[taskId].dependencies IN
                 directDeps \cup 
                 UNION {TransitiveDeps(dep) : dep \in directDeps}
    IN
    \A taskId \in DOMAIN tasks :
        taskId \notin TransitiveDeps(taskId)

AuthenticationRequired ==
    \* All task operations require authentication
    \A taskId \in DOMAIN tasks :
        tasks[taskId].createdBy \in Users

SafetyInvariant ==
    /\ NoOrphanTasks
    /\ TaskOwnership
    /\ ValidTaskIds
    /\ NoDuplicateTaskIds
    /\ ValidStateTransitionsInvariant
    /\ ConsistentTimestamps
    /\ NoCyclicDependencies
    /\ AuthenticationRequired

\* Next state relation
Next ==
    \/ AdvanceTime
    \/ \E user \in Users : Authenticate(user)
    \/ Logout
    \/ \E t \in Titles, d \in Descriptions, p \in Priorities, 
         u \in Users, dd \in 0..MaxTime \cup {"NULL"},
         tags \in SUBSET {"bug", "feature", "enhancement", "documentation"},
         deps \in SUBSET DOMAIN tasks :
       CreateTask(t, d, p, u, dd, tags, deps)
    \/ \E taskId \in DOMAIN tasks, newStatus \in TaskStates :
       UpdateTaskStatus(taskId, newStatus)
    \/ \E taskId \in DOMAIN tasks, newPriority \in Priorities :
       UpdateTaskPriority(taskId, newPriority)
    \/ \E taskId \in DOMAIN tasks, newAssignee \in Users :
       ReassignTask(taskId, newAssignee)
    \/ \E taskId \in DOMAIN tasks, t \in Titles, 
         d \in Descriptions, dd \in 0..MaxTime \cup {"NULL"} :
       UpdateTaskDetails(taskId, t, d, dd)
    \/ \E taskId \in DOMAIN tasks : DeleteTask(taskId)
    \/ CheckDependencies
    \/ \E taskIds \in SUBSET DOMAIN tasks, newStatus \in TaskStates :
       taskIds # {} /\ BulkUpdateStatus(taskIds, newStatus)

\* Properties to check
THEOREM TypeCorrectness == Spec => []TypeInvariant
THEOREM SafetyHolds == Spec => []SafetyInvariant
THEOREM LivenessHolds == Spec => (EventualCompletion /\ FairProgress)
THEOREM NoDeadlock == Spec => []<>Next
THEOREM Termination == Spec => <>(\A taskId \in DOMAIN tasks : 
                                    tasks[taskId].status \in {"completed", "cancelled"})
=============================================================================

Step 4: Model Checking and Trace Generation

Now we can run TLA+ model checking to verify our specification (TaskManagement.cfg):

\* Model configuration for TaskManagementImproved module
SPECIFICATION Spec

\* Constants definition
CONSTANTS
    Users = {alice, bob, charlie}
    MaxTasks = 5
    MaxTime = 20
    Titles = {task1, task2, task3, task4, task5}
    Descriptions = {desc1, desc2, desc3}

\* Model values for special constants
CONSTANT
    NULL = NULL
    EMPTY_STRING = EMPTY_STRING

\* Initial state constraint
CONSTRAINT
    /\ nextTaskId <= MaxTasks + 1
    /\ clock <= MaxTime
    /\ Cardinality(DOMAIN tasks) <= MaxTasks

\* State space reduction (optional, for faster checking)
ACTION_CONSTRAINT
    \* Limit number of active sessions
    /\ Cardinality({u \in Users : sessions[u] = TRUE}) <= 2
    \* Prevent creating too many tasks at once
    /\ nextTaskId <= MaxTasks

\* Invariants to check
INVARIANT TypeInvariant
INVARIANT SafetyInvariant
INVARIANT NoOrphanTasks
INVARIANT TaskOwnership
INVARIANT ValidTaskIds
INVARIANT NoDuplicateTaskIds
INVARIANT ValidStateTransitionsInvariant
INVARIANT ConsistentTimestamps
INVARIANT NoCyclicDependencies
INVARIANT AuthenticationRequired

\* Properties to check
PROPERTY EventualCompletion
PROPERTY FairProgress
PROPERTY EventualUnblocking
PROPERTY EventualAuthentication
PROPERTY NoStarvation

\* Check for deadlocks
CHECK_DEADLOCK TRUE

\* View for debugging (optional)
VIEW <<nextTaskId, Cardinality(DOMAIN tasks), clock>>

\* Alias for better state visualization
ALIAS TaskSummary == [
    totalTasks |-> Cardinality(DOMAIN tasks),
    pendingTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "pending"}),
    inProgressTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "in_progress"}),
    completedTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "completed"}),
    blockedTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "blocked"}),
    activeSessions |-> {u \in Users : sessions[u] = TRUE},
    currentUser |-> currentUser,
    time |-> clock
]

# Run the model checker
java -jar tla2tools.jar -config TaskManagement.cfg TaskManagement.tla

TLA+ will explore all possible execution paths and either:

Verify that all properties hold
Generate counterexamples showing property violations

Step 5: Claude Implementation with TLA+ Context

Now comes the magic. Instead of vague natural language, we give Claude the precise TLA+ specification:

My prompt to Claude:

Using this TLA+ specification for a task management system, implement a Go REST API 
that satisfies all the specified behaviors and properties.

[Include the full TLA+ specification here]

Requirements:
1. Implement all actions defined in the TLA+ spec
2. Ensure all safety properties hold at runtime
3. Use Go's type system to enforce the data structures
4. Include comprehensive tests that cover the execution traces
5. Add proper error handling for violation of preconditions

Focus on behavioral correctness over performance optimization.

Claude generates something like (task.go):

// Package domain contains the core business entities matching the TLA+ specification
package domain

import (
	"fmt"
	"time"
)

// TaskID represents a unique task identifier (maps to TLA+ task ID)
type TaskID int

// UserID represents a user identifier (maps to TLA+ Users set)
type UserID string

// TaskStatus represents the state of a task (maps to TLA+ TaskStates)
type TaskStatus string

const (
	StatusPending    TaskStatus = "pending"
	StatusInProgress TaskStatus = "in_progress"
	StatusCompleted  TaskStatus = "completed"
	StatusCancelled  TaskStatus = "cancelled"
	StatusBlocked    TaskStatus = "blocked"
)

// Priority represents task priority levels (maps to TLA+ Priorities)
type Priority string

const (
	PriorityLow      Priority = "low"
	PriorityMedium   Priority = "medium"
	PriorityHigh     Priority = "high"
	PriorityCritical Priority = "critical"
)

// Tag represents task categories (maps to TLA+ tags subset)
type Tag string

const (
	TagBug           Tag = "bug"
	TagFeature       Tag = "feature"
	TagEnhancement   Tag = "enhancement"
	TagDocumentation Tag = "documentation"
)

// Task represents a task entity (maps to TLA+ task record)
type Task struct {
	ID           TaskID            `json:"id"`
	Title        string            `json:"title"`
	Description  string            `json:"description"`
	Status       TaskStatus        `json:"status"`
	Priority     Priority          `json:"priority"`
	Assignee     UserID            `json:"assignee"`
	CreatedBy    UserID            `json:"created_by"`
	CreatedAt    time.Time         `json:"created_at"`
	UpdatedAt    time.Time         `json:"updated_at"`
	DueDate      *time.Time        `json:"due_date,omitempty"`
	Tags         []Tag             `json:"tags"`
	Dependencies map[TaskID]bool   `json:"dependencies"`
}

// ValidTransition represents a valid state transition (maps to TLA+ ValidTransitions)
type ValidTransition struct {
	From TaskStatus
	To   TaskStatus
}

// ValidTransitions defines all allowed state transitions
var ValidTransitions = map[ValidTransition]bool{
	{StatusPending, StatusInProgress}:    true,
	{StatusPending, StatusCancelled}:     true,
	{StatusPending, StatusBlocked}:       true,
	{StatusInProgress, StatusCompleted}:  true,
	{StatusInProgress, StatusCancelled}:  true,
	{StatusInProgress, StatusBlocked}:    true,
	{StatusInProgress, StatusPending}:    true, // Allow reverting
	{StatusBlocked, StatusPending}:       true,
	{StatusBlocked, StatusInProgress}:    true,
	{StatusBlocked, StatusCancelled}:     true,
}

// IsValidTransition checks if a state transition is valid (maps to TLA+ IsValidTransition)
func IsValidTransition(from, to TaskStatus) bool {
	return ValidTransitions[ValidTransition{From: from, To: to}]
}

// CanDelete checks if a task can be deleted (only completed or cancelled)
func (t *Task) CanDelete() bool {
	return t.Status == StatusCompleted || t.Status == StatusCancelled
}

// IsBlocked checks if task should be blocked based on dependencies
func (t *Task) IsBlocked(allTasks map[TaskID]*Task) bool {
	if len(t.Dependencies) == 0 {
		return false
	}
	
	for depID := range t.Dependencies {
		if dep, exists := allTasks[depID]; exists {
			if dep.Status != StatusCompleted {
				return true
			}
		}
	}
	return false
}

// ShouldUnblock checks if a blocked task can be unblocked
func (t *Task) ShouldUnblock(allTasks map[TaskID]*Task) bool {
	if t.Status != StatusBlocked {
		return false
	}
	
	for depID := range t.Dependencies {
		if dep, exists := allTasks[depID]; exists {
			if dep.Status != StatusCompleted {
				return false
			}
		}
	}
	return true
}

// Validate performs domain validation on the task
func (t *Task) Validate() error {
	if t.Title == "" {
		return fmt.Errorf("task title cannot be empty")
	}
	if t.Description == "" {
		return fmt.Errorf("task description cannot be empty")
	}
	if !isValidStatus(t.Status) {
		return fmt.Errorf("invalid task status: %s", t.Status)
	}
	if !isValidPriority(t.Priority) {
		return fmt.Errorf("invalid task priority: %s", t.Priority)
	}
	if t.Assignee == "" {
		return fmt.Errorf("task must have an assignee")
	}
	if t.CreatedBy == "" {
		return fmt.Errorf("task must have a creator")
	}
	if t.CreatedAt.After(t.UpdatedAt) {
		return fmt.Errorf("created time cannot be after updated time")
	}
	for _, tag := range t.Tags {
		if !isValidTag(tag) {
			return fmt.Errorf("invalid tag: %s", tag)
		}
	}
	return nil
}

func isValidStatus(status TaskStatus) bool {
	switch status {
	case StatusPending, StatusInProgress, StatusCompleted, StatusCancelled, StatusBlocked:
		return true
	default:
		return false
	}
}

func isValidPriority(priority Priority) bool {
	switch priority {
	case PriorityLow, PriorityMedium, PriorityHigh, PriorityCritical:
		return true
	default:
		return false
	}
}

func isValidTag(tag Tag) bool {
	switch tag {
	case TagBug, TagFeature, TagEnhancement, TagDocumentation:
		return true
	default:
		return false
	}
}

// Package usecase implements the TLA+ actions as use cases
package usecase

import (
	"crypto/rand"
	"encoding/hex"
	"fmt"
	"time"
	
	"github.com/bhatti/sample-task-management/internal/domain"
	"github.com/bhatti/sample-task-management/internal/repository"
)

// TaskUseCase implements task-related TLA+ actions
type TaskUseCase struct {
	uow              repository.UnitOfWork
	invariantChecker InvariantChecker
}

// InvariantChecker interface for runtime invariant validation
type InvariantChecker interface {
	CheckAllInvariants(state *domain.SystemState) error
	CheckTaskInvariants(task *domain.Task, state *domain.SystemState) error
	CheckTransitionInvariant(from, to domain.TaskStatus) error
}

// NewTaskUseCase creates a new task use case
func NewTaskUseCase(uow repository.UnitOfWork, checker InvariantChecker) *TaskUseCase {
	return &TaskUseCase{
		uow:              uow,
		invariantChecker: checker,
	}
}

// Authenticate implements TLA+ Authenticate action
func (uc *TaskUseCase) Authenticate(userID domain.UserID) (*domain.Session, error) {
	// Preconditions from TLA+:
	// - user \in Users
	// - ~sessions[user]
	
	user, err := uc.uow.Users().GetUser(userID)
	if err != nil {
		return nil, fmt.Errorf("user not found: %w", err)
	}
	
	// Check if user already has an active session
	existingSession, _ := uc.uow.Sessions().GetSessionByUser(userID)
	if existingSession != nil && existingSession.IsValid() {
		return nil, fmt.Errorf("user %s already has an active session", userID)
	}
	
	// Create new session
	token := generateToken()
	session := &domain.Session{
		UserID:    user.ID,
		Token:     token,
		Active:    true,
		CreatedAt: time.Now(),
		ExpiresAt: time.Now().Add(24 * time.Hour),
	}
	
	// Update state
	if err := uc.uow.Sessions().CreateSession(session); err != nil {
		return nil, fmt.Errorf("failed to create session: %w", err)
	}
	
	if err := uc.uow.SystemState().SetCurrentUser(&userID); err != nil {
		return nil, fmt.Errorf("failed to set current user: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return nil, fmt.Errorf("invariant violation: %w", err)
	}
	
	return session, nil
}


// CreateTask implements TLA+ CreateTask action
func (uc *TaskUseCase) CreateTask(
	title, description string,
	priority domain.Priority,
	assignee domain.UserID,
	dueDate *time.Time,
	tags []domain.Tag,
	dependencies []domain.TaskID,
) (*domain.Task, error) {
	// Preconditions from TLA+:
	// - currentUser # NULL
	// - currentUser \in Users
	// - nextTaskId <= MaxTasks
	// - deps \subseteq DOMAIN tasks
	// - \A dep \in deps : tasks[dep].status # "cancelled"
	
	currentUser, err := uc.uow.SystemState().GetCurrentUser()
	if err != nil || currentUser == nil {
		return nil, fmt.Errorf("authentication required")
	}
	
	// Check max tasks limit
	nextID, err := uc.uow.SystemState().GetNextTaskID()
	if err != nil {
		return nil, fmt.Errorf("failed to get next task ID: %w", err)
	}
	
	if nextID > domain.MaxTasks {
		return nil, fmt.Errorf("maximum number of tasks (%d) reached", domain.MaxTasks)
	}
	
	// Validate dependencies
	allTasks, err := uc.uow.Tasks().GetAllTasks()
	if err != nil {
		return nil, fmt.Errorf("failed to get tasks: %w", err)
	}
	
	depMap := make(map[domain.TaskID]bool)
	for _, depID := range dependencies {
		depTask, exists := allTasks[depID]
		if !exists {
			return nil, fmt.Errorf("dependency task %d does not exist", depID)
		}
		if depTask.Status == domain.StatusCancelled {
			return nil, fmt.Errorf("cannot depend on cancelled task %d", depID)
		}
		depMap[depID] = true
	}
	
	// Check for cyclic dependencies
	if err := uc.checkCyclicDependencies(nextID, depMap, allTasks); err != nil {
		return nil, err
	}
	
	// Determine initial status based on dependencies
	status := domain.StatusPending
	if len(dependencies) > 0 {
		// Check if all dependencies are completed
		allCompleted := true
		for depID := range depMap {
			if allTasks[depID].Status != domain.StatusCompleted {
				allCompleted = false
				break
			}
		}
		if !allCompleted {
			status = domain.StatusBlocked
		}
	}
	
	// Create task
	task := &domain.Task{
		ID:           nextID,
		Title:        title,
		Description:  description,
		Status:       status,
		Priority:     priority,
		Assignee:     assignee,
		CreatedBy:    *currentUser,
		CreatedAt:    time.Now(),
		UpdatedAt:    time.Now(),
		DueDate:      dueDate,
		Tags:         tags,
		Dependencies: depMap,
	}
	
	// Validate task
	if err := task.Validate(); err != nil {
		return nil, fmt.Errorf("task validation failed: %w", err)
	}
	
	// Save task
	if err := uc.uow.Tasks().CreateTask(task); err != nil {
		return nil, fmt.Errorf("failed to create task: %w", err)
	}
	
	// Increment next task ID
	if _, err := uc.uow.SystemState().IncrementNextTaskID(); err != nil {
		return nil, fmt.Errorf("failed to increment task ID: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return nil, fmt.Errorf("invariant violation after task creation: %w", err)
	}
	
	return task, nil
}

// UpdateTaskStatus implements TLA+ UpdateTaskStatus action
func (uc *TaskUseCase) UpdateTaskStatus(taskID domain.TaskID, newStatus domain.TaskStatus) error {
	// Preconditions from TLA+:
	// - currentUser # NULL
	// - TaskExists(taskId)
	// - taskId \in GetUserTasks(currentUser)
	// - IsValidTransition(tasks[taskId].status, newStatus)
	// - newStatus = "in_progress" => all dependencies completed
	
	currentUser, err := uc.uow.SystemState().GetCurrentUser()
	if err != nil || currentUser == nil {
		return fmt.Errorf("authentication required")
	}
	
	task, err := uc.uow.Tasks().GetTask(taskID)
	if err != nil {
		return fmt.Errorf("task not found: %w", err)
	}
	
	// Check user owns the task
	userTasks, err := uc.uow.SystemState().GetUserTasks(*currentUser)
	if err != nil {
		return fmt.Errorf("failed to get user tasks: %w", err)
	}
	
	hasTask := false
	for _, id := range userTasks {
		if id == taskID {
			hasTask = true
			break
		}
	}
	
	if !hasTask {
		return fmt.Errorf("user does not have access to task %d", taskID)
	}
	
	// Check valid transition
	if !domain.IsValidTransition(task.Status, newStatus) {
		return fmt.Errorf("invalid transition from %s to %s", task.Status, newStatus)
	}
	
	// Check dependencies if moving to in_progress
	if newStatus == domain.StatusInProgress {
		allTasks, _ := uc.uow.Tasks().GetAllTasks()
		for depID := range task.Dependencies {
			if depTask, exists := allTasks[depID]; exists {
				if depTask.Status != domain.StatusCompleted {
					return fmt.Errorf("cannot start task: dependency %d is not completed", depID)
				}
			}
		}
	}
	
	// Update status
	task.Status = newStatus
	task.UpdatedAt = time.Now()
	
	if err := uc.uow.Tasks().UpdateTask(task); err != nil {
		return fmt.Errorf("failed to update task: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return fmt.Errorf("invariant violation: %w", err)
	}
	
	return nil
}
...

Step 6: TLA+ Generated Tests

The real power comes when we use TLA+ execution traces to generate comprehensive tests:

My prompt to Claude:

Generate Go tests that verify the implementation satisfies the TLA+ specification.
Create test cases that:

1. Test all TLA+ actions with valid preconditions
2. Test safety property violations
3. Test edge cases from the TLA+ model boundary conditions
4. Use property-based testing where appropriate

Include tests that would catch the execution traces TLA+ model checker explores.

Claude generates (invariant_test.go):

// Package property implements property-based tests for TLA+ invariants
package property

import (
	"math/rand"
	"testing"
	"time"

	"github.com/bhatti/sample-task-management/internal/domain"
	"github.com/bhatti/sample-task-management/internal/infrastructure/memory"
	"github.com/bhatti/sample-task-management/internal/usecase"
	"github.com/bhatti/sample-task-management/pkg/invariants"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

// TestInvariantsHoldAfterOperations verifies invariants hold after each operation
func TestInvariantsHoldAfterOperations(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()
	uc := usecase.NewTaskUseCase(uow, checker)

	// Setup initial users
	users := []domain.UserID{"alice", "bob", "charlie"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		require.NoError(t, repo.CreateUser(user))
	}

	// Property: Invariants hold after authentication
	t.Run("InvariantsAfterAuthentication", func(t *testing.T) {
		for _, userID := range users {
			session, err := uc.Authenticate(userID)
			assert.NoError(t, err)
			assert.NotNil(t, session)

			state, _ := repo.GetSystemState()
			assert.NoError(t, checker.CheckAllInvariants(state))

			// Cleanup
			_ = uc.Logout(userID)
		}
	})

	// Property: Invariants hold after task creation
	t.Run("InvariantsAfterTaskCreation", func(t *testing.T) {
		uc.Authenticate("alice")

		for i := 0; i < 10; i++ {
			task, err := uc.CreateTask(
				"Task "+string(rune(i)),
				"Description",
				randomPriority(),
				randomUser(users),
				randomDueDate(),
				randomTags(),
				[]domain.TaskID{}, // No dependencies initially
			)

			assert.NoError(t, err)
			assert.NotNil(t, task)

			state, _ := repo.GetSystemState()
			assert.NoError(t, checker.CheckAllInvariants(state))
		}
	})

	// Property: Invariants hold after status transitions
	t.Run("InvariantsAfterStatusTransitions", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create a task
		task, _ := uc.CreateTask(
			"Test Task",
			"Description",
			domain.PriorityMedium,
			"alice",
			nil,
			[]domain.Tag{domain.TagFeature},
			[]domain.TaskID{},
		)

		// Valid transitions
		validTransitions := []domain.TaskStatus{
			domain.StatusInProgress,
			domain.StatusCompleted,
		}

		for _, status := range validTransitions {
			err := uc.UpdateTaskStatus(task.ID, status)
			if err == nil {
				state, _ := repo.GetSystemState()
				assert.NoError(t, checker.CheckAllInvariants(state))
			}
		}
	})

	// Property: No cyclic dependencies can be created
	t.Run("NoCyclicDependencies", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create tasks with potential cycles
		task1, _ := uc.CreateTask("Task1", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{})
		task2, _ := uc.CreateTask("Task2", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{task1.ID})
		task3, _ := uc.CreateTask("Task3", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{task2.ID})

		// Attempting to create a cycle should fail
		_, err := uc.CreateTask("Task4", "Desc", domain.PriorityLow, "alice", nil, nil,
			[]domain.TaskID{task3.ID, task1.ID}) // This would create a cycle
		assert.NoError(t, err)

		// Even if it doesn't fail explicitly, invariants should catch it
		state, _ := repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))
	})
}

// TestTransitionInvariants tests state transition validity
func TestTransitionInvariants(t *testing.T) {
	checker := invariants.NewInvariantChecker()

	// Test all valid transitions
	validTransitions := []struct {
		from domain.TaskStatus
		to   domain.TaskStatus
	}{
		{domain.StatusPending, domain.StatusInProgress},
		{domain.StatusPending, domain.StatusCancelled},
		{domain.StatusInProgress, domain.StatusCompleted},
		{domain.StatusInProgress, domain.StatusCancelled},
		{domain.StatusBlocked, domain.StatusPending},
		{domain.StatusBlocked, domain.StatusCancelled},
	}

	for _, trans := range validTransitions {
		t.Run(string(trans.from)+"_to_"+string(trans.to), func(t *testing.T) {
			err := checker.CheckTransitionInvariant(trans.from, trans.to)
			assert.NoError(t, err)
		})
	}

	// Test invalid transitions
	invalidTransitions := []struct {
		from domain.TaskStatus
		to   domain.TaskStatus
	}{
		{domain.StatusCompleted, domain.StatusPending},
		{domain.StatusCompleted, domain.StatusInProgress},
		{domain.StatusCancelled, domain.StatusInProgress},
		{domain.StatusPending, domain.StatusCompleted}, // Must go through in_progress
	}

	for _, trans := range invalidTransitions {
		t.Run("Invalid_"+string(trans.from)+"_to_"+string(trans.to), func(t *testing.T) {
			err := checker.CheckTransitionInvariant(trans.from, trans.to)
			assert.Error(t, err)
		})
	}
}

// TestPropertyTaskOwnership verifies task ownership invariants
func TestPropertyTaskOwnership(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()
	uc := usecase.NewTaskUseCase(uow, checker)

	// Setup users
	users := []domain.UserID{"alice", "bob"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		repo.CreateUser(user)
	}

	// Property: Task reassignment maintains ownership invariants
	t.Run("ReassignmentMaintainsOwnership", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create task assigned to Alice
		task, err := uc.CreateTask(
			"Test Task",
			"Description",
			domain.PriorityHigh,
			"alice",
			nil,
			[]domain.Tag{domain.TagBug},
			[]domain.TaskID{},
		)
		require.NoError(t, err)

		// Check initial ownership
		state, _ := repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))

		aliceTasks := state.GetUserTasks("alice")
		assert.Contains(t, aliceTasks, task.ID)

		// Reassign to Bob
		err = uc.ReassignTask(task.ID, "bob")
		require.NoError(t, err)

		// Check ownership after reassignment
		state, _ = repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))

		aliceTasks = state.GetUserTasks("alice")
		bobTasks := state.GetUserTasks("bob")
		assert.NotContains(t, aliceTasks, task.ID)
		assert.Contains(t, bobTasks, task.ID)
	})
}

// TestPropertyConcurrentOperations tests invariants under concurrent operations
func TestPropertyConcurrentOperations(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()

	// Setup users
	users := []domain.UserID{"user1", "user2", "user3"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		repo.CreateUser(user)
	}

	// Run concurrent operations
	done := make(chan bool, len(users))

	for _, userID := range users {
		go func(uid domain.UserID) {
			uc := usecase.NewTaskUseCase(uow, checker)

			// Authenticate
			uc.Authenticate(uid)

			// Create multiple tasks
			for i := 0; i < 5; i++ {
				uc.CreateTask(
					"Task",
					"Description",
					randomPriority(),
					uid,
					nil,
					randomTags(),
					[]domain.TaskID{},
				)

				// Random delay
				time.Sleep(time.Duration(rand.Intn(10)) * time.Millisecond)
			}

			done <- true
		}(userID)
	}

	// Wait for all goroutines
	for i := 0; i < len(users); i++ {
		<-done
	}

	// Check invariants after concurrent operations
	state, _ := repo.GetSystemState()
	assert.NoError(t, checker.CheckAllInvariants(state))
}

// Helper functions

func randomPriority() domain.Priority {
	priorities := []domain.Priority{
		domain.PriorityLow,
		domain.PriorityMedium,
		domain.PriorityHigh,
		domain.PriorityCritical,
	}
	return priorities[rand.Intn(len(priorities))]
}

func randomUser(users []domain.UserID) domain.UserID {
	return users[rand.Intn(len(users))]
}

func randomDueDate() *time.Time {
	if rand.Float32() < 0.5 {
		return nil
	}
	due := time.Now().Add(time.Duration(rand.Intn(30)) * 24 * time.Hour)
	return &due
}

func randomTags() []domain.Tag {
	allTags := []domain.Tag{
		domain.TagBug,
		domain.TagFeature,
		domain.TagEnhancement,
		domain.TagDocumentation,
	}

	numTags := rand.Intn(len(allTags) + 1)
	if numTags == 0 {
		return nil
	}

	tags := make([]domain.Tag, 0, numTags)
	used := make(map[domain.Tag]bool)

	for len(tags) < numTags {
		tag := allTags[rand.Intn(len(allTags))]
		if !used[tag] {
			tags = append(tags, tag)
			used[tag] = true
		}
	}

	return tags
}

Advanced TLA+ Patterns with Claude

Modeling Concurrent Operations

One of TLA+’s strengths is modeling concurrent systems. Let’s extend our specification to handle concurrent task updates:

\* Concurrent task updates with conflict resolution
ConcurrentUpdateTask(taskId, newStatus, version) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ tasks[taskId].version = version  \* Optimistic concurrency control
    /\ tasks' = [tasks EXCEPT ![taskId] = [
                     @ EXCEPT 
                     !.status = newStatus,
                     !.version = @ + 1,
                     !.lastModified = currentUser
                 ]]
    /\ UNCHANGED <<userTasks, nextTaskId, currentUser>>

Prompt to Claude:

Implement optimistic concurrency control for the task updates based on this 
TLA+ specification. Include version tracking and conflict detection.

Modeling Complex Business Rules

TLA+ excels at capturing complex business logic:

\* Business rule: High priority tasks cannot be cancelled directly
ValidStatusTransition(currentStatus, newStatus, priority) ==
    \/ newStatus = currentStatus
    \/ /\ currentStatus = "pending" 
       /\ newStatus \in {"in_progress", "cancelled"}
    \/ /\ currentStatus = "in_progress"
       /\ newStatus \in {"completed", "pending"}
    \/ /\ currentStatus = "in_progress"
       /\ newStatus = "cancelled"
       /\ priority # "high"  \* High priority tasks cannot be cancelled

Lessons Learned

After applying this TLA+ approach to several experimental projects, here are the key insights:

1. Start Small

Begin with core actions and properties. TLA+ specifications can grow complex quickly, so start with the essential behaviors:

\* Start with basic CRUD
Init, CreateTask, UpdateTask, DeleteTask

\* Add complexity incrementally  
Authentication, Authorization, Concurrency, Business Rules

Avoid Initially: Complex distributed systems, performance-critical algorithms

Graduate To: Multi-service interactions, complex business logic

2. Properties Drive Design

Writing TLA+ properties often reveals design flaws before implementation:

\* This property might fail, revealing a design issue
ConsistencyProperty == 
    \A user \in Users:
        \A taskId \in userTasks[user]:
            /\ taskId \in DOMAIN tasks
            /\ tasks[taskId].assignee = user
            /\ tasks[taskId].status # "deleted"  \* Soft delete consideration

3. Model Checking Finds Edge Cases

TLA+ model checking explores execution paths you’d never think to test:

# TLA+ finds this counterexample:
# Step 1: User1 creates Task1
# Step 2: User1 deletes Task1  
# Step 3: User2 creates Task2 (gets same ID due to reuse)
# Step 4: User1 tries to update Task1 -> Security violation!

This led to using UUIDs instead of incrementing integers for task IDs.

4. Generated Tests Are Comprehensive

TLA+ execution traces become your regression test suite. When Claude implements based on TLA+ specs, you get:

Complete coverage – All specification paths tested
Edge case detection – Boundary conditions from model checking
Behavioral contracts – Tests verify actual system properties

Documentation Generation

Prompt to Claude:

Generate API documentation from this TLA+ specification that includes:
1. Endpoint descriptions derived from TLA+ actions
2. Request/response schemas from TLA+ data structures  
3. Error conditions from TLA+ preconditions
4. Behavioral guarantees from TLA+ properties

Code Review Guidelines

With TLA+ specifications, code reviews become more focused:

Does implementation satisfy the TLA+ spec?
Are all preconditions checked?
Do safety properties hold?
Are error conditions handled as specified?

Comparing Specification Approaches

Approach	Precision	AI Effectiveness	Maintenance	Learning Curve	Tool Complexity	Code Generation
Vibe Coding	Low	Inconsistent	High	Low	Low	N/A
UML/MDD	Medium	Poor	Very High	High	Very High	Brittle
BDD/Gherkin	Medium	Better	Medium	Medium	Low	Limited
TLA+ Specs	High	Excellent	Low	High	Low	Reliable

Tools and Resources

Essential TLA+ Resources

Learn TLA+: https://learntla.com – Interactive tutorial
TLA+ Video Course: Leslie Lamport’s official course
Practical TLA+: Hillel Wayne’s book – focus on software systems
TLA+ Examples: https://github.com/tlaplus/Examples

Common Mistakes

1. Avoid These Mistakes

? Writing TLA+ like code

\* Wrong - this looks like pseudocode
CreateTask == 
    if currentUser != null then
        task = new Task()

? Writing TLA+ as mathematical relations

\* Right - mathematical specification  
CreateTask == 
    /\ currentUser # NULL
    /\ tasks' = tasks @@ (nextTaskId :> newTask)

? Asking Claude to “fix the TLA+ to match the code”

The spec is the truth – fix the code to match the spec

? Asking Claude to “implement this TLA+ specification correctly”

? Specification scope creep: Starting with entire system architecture ? Incremental approach: Begin with one core workflow, expand gradually

2. Claude Integration Pitfalls

? “Fix the spec to match my code”: Treating specifications as documentation ? “Fix the code to match the spec”: Specifications are the source of truth

3. The Context Overload Trap

Problem: Dumping too much information at once
Solution: Break complex features into smaller, focused requests

4. The “Fix My Test” Antipattern

Problem: When tests fail, asking Claude to modify the test instead of the code
Solution: Always fix the implementation, not the test (unless the test is genuinely wrong)

5. The Blind Trust Mistake

Problem: Accepting generated code without understanding it
Solution: Always review and understand the code before committing

Proven Patterns

1. Save effective prompts:

# ~/.claude/tla-prompts/implementation.md
Implement [language] code that satisfies this TLA+ specification:

[SPEC]

Requirements:
- All TLA+ actions become functions/methods
- All preconditions become runtime checks  
- All data structures match TLA+ types
- Include comprehensive tests covering specification traces

Create specification templates:

--------------------------- MODULE [ModuleName] ---------------------------
EXTENDS Integers, Sequences, FiniteSets

CONSTANTS [Constants]

VARIABLES [StateVariables]

[TypeDefinitions]

Init == [InitialConditions]

[Actions]

Next == [ActionDisjunction]

Spec == Init /\ [][Next]_[StateVariables]

[SafetyProperties]

[LivenessProperties]

=============================================================================

2. The “Explain First” Pattern

Before asking Claude to implement something complex, I ask for an explanation:

Explain how you would implement real-time task updates using WebSockets. 
What are the trade-offs between Socket.io and native WebSockets?
What state management challenges should I consider?

3. The “Progressive Enhancement” Pattern

Start simple, then add complexity:

1. First: "Create a basic task model with CRUD operations"
2. Then: "Add validation and error handling"
3. Then: "Add authentication and authorization"
4. Finally: "Add real-time updates and notifications"

4. The “Code Review” Pattern

After implementation, I ask Claude to review its own code:

Review the task API implementation for:
- Security vulnerabilities
- Performance issues
- Code style consistency
- Missing error cases

Be critical and suggest improvements.

What’s Next

As I’ve developed this TLA+/Claude workflow, I’ve realized we’re approaching something profound: specifications as the primary artifact. Instead of writing code and hoping it’s correct, we’re defining correct behavior formally and letting AI generate the implementation. This inverts the traditional relationship between specification and code.

Implications for Software Engineering

Design-first development becomes natural
Bug prevention replaces bug fixing
Refactoring becomes re-implementation from stable specs
Documentation is always up-to-date (it’s the spec)

I’m currently experimenting with:

TLA+ to test case generation – Automated comprehensive testing
Multi-language implementations – Same spec, different languages
Specification composition – Building larger systems from verified components
Quint specifications – A modern executable specification language with simpler syntax than TLA+

Conclusion: The End of Vibe Coding

After using TLA+ with Claude, I can’t go back to vibe coding. The precision, reliability, and confidence that comes from executable specifications has transformed how I build software. The complete working example—TLA+ specs, Go implementation, comprehensive tests, and CI/CD pipeline—is available at github.com/bhatti/sample-task-management.

Yes, there’s a learning curve. Yes, writing TLA+ specifications takes time upfront. But the payoff—in terms of correctness, maintainability, and development speed—is extraordinary. Claude becomes not just a code generator, but a reliable engineering partner that can reason about complex systems precisely because we’ve given it precise specifications to work from. We’re moving from “code and hope” to “specify and know”—and that changes everything.

Comments (0)

August 16, 2025

The Complete Guide to gRPC Load Balancing in Kubernetes and Istio

Filed under: Computing,Web Services — Tags: Istio, Kubernetes — admin @ 12:05 pm

TL;DR – The Test Results Matrix

Configuration	Load Balancing	Why
Local gRPC	? None	Single server instance
Kubernetes + gRPC	? None	Connection-level LB only
Kubernetes + Istio	? Perfect	L7 proxy with request-level LB
Client-side LB	?? Limited	Requires multiple endpoints
kubectl port-forward + Istio	? None	Bypasses service mesh

Complete test suite ?

Introduction: The gRPC Load Balancing Problem

When you deploy a gRPC service in Kubernetes with multiple replicas, you expect load balancing. You won’t get it. This guide tests every possible configuration to prove why, and shows exactly how to fix it. According to the official gRPC documentation:

“gRPC uses HTTP/2, which multiplexes multiple calls on a single TCP connection. This means that once the connection is established, all gRPC calls will go to the same backend.”

Complete Test Matrix

We’ll test 6 different configurations:

Baseline: Local Testing (Single server)
Kubernetes without Istio (Standard deployment)
Kubernetes with Istio (Service mesh solution)
Client-side Load Balancing (gRPC built-in)
Advanced Connection Testing (Multiple connections)
Real-time Monitoring (Live traffic analysis)

Prerequisites

git clone https://github.com/bhatti/grpc-lb-test
cd grpc-lb-test

# Build all components
make build

Test 1: Baseline – Local Testing

Purpose: Establish baseline behavior with a single server.

# Terminal 1: Start local server
./bin/server

# Terminal 2: Test with basic client
./bin/client -target localhost:50051 -requests 50

Expected Result:

? Load Distribution Results:
Server: unknown-1755316152
Pod: unknown (IP: unknown)
Requests: 50 (100.0%)
????????????????????
? Total servers hit: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.

Analysis: This confirms our client implementation works correctly and establishes the baseline.

Test 2: Kubernetes Without Istio

Purpose: Prove that standard Kubernetes doesn’t provide gRPC request-level load balancing.

Deploy the Service

# Deploy 5 replicas without Istio
./scripts/test-without-istio.sh

The k8s/without-istio/deployment.yaml creates:

5 gRPC server replicas
Standard Kubernetes Service
No Istio annotations

Test Results

???? Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-gh5z5-1755316388
  Pod: grpc-echo-server-5b657689db-gh5z5 (IP: 10.1.4.148)
  Requests: 30 (100.0%)
  ????????????????????

???? Total servers hit: 1
??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.

???? Connection Analysis:
Without Istio, gRPC maintains a single TCP connection to the Kubernetes Service IP.
The kube-proxy performs L4 load balancing, but gRPC reuses the same connection.

???? Cleaning up...
deployment.apps "grpc-echo-server" deleted
service "grpc-echo-service" deleted
./scripts/test-without-istio.sh: line 57: 17836 Terminated: 15   
kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1

??  RESULT: No load balancing observed - all requests went to single pod!

Why This Happens

The Kubernetes Service documentation explains:

“For each Service, kube-proxy installs iptables rules which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service’s backend endpoints.”

Kubernetes Services perform L4 (connection-level) load balancing, but gRPC maintains persistent connections.

Connection Analysis

Run the analysis tool to see connection behavior:

./bin/analysis -target localhost:50051 -requests 100 -test-scenarios true

Result:

? NO LOAD BALANCING: All requests to single server

???? Connection Reuse Analysis:
  Average requests per connection: 1.00
  ??  Low connection reuse (many short connections)

? Connection analysis complete!

Test 3: Kubernetes With Istio

Purpose: Demonstrate how Istio’s L7 proxy solves the load balancing problem.

Install Istio

./scripts/install-istio.sh

This follows Istio’s official installation guide:

istioctl install --set profile=demo -y
kubectl label namespace default istio-injection=enabled

Deploy With Istio

./scripts/test-with-istio.sh

The k8s/with-istio/deployment.yaml includes:

annotations:
  sidecar.istio.io/inject: "true"
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-echo-service
spec:
  host: grpc-echo-service
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 100
        maxRequestsPerConnection: 10
    loadBalancer:
      simple: ROUND_ROBIN

Critical Testing Gotcha

? Wrong way (what most people do):

kubectl port-forward service/grpc-echo-service 50051:50051
./bin/client -target localhost:50051 -requests 50
# Result: Still no load balancing!

According to Istio’s architecture docs, kubectl port-forward bypasses the Envoy sidecar proxy.

? Correct Testing Method

Test from inside the service mesh:

./scripts/test-with-istio.sh

Test Results With Istio

???? Load Distribution Results:
================================

Server: grpc-echo-server-579dfbc76b-m2v7x-1755357769
  Pod: grpc-echo-server-579dfbc76b-m2v7x (IP: 10.1.4.237)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-fkgkk-1755357769
  Pod: grpc-echo-server-579dfbc76b-fkgkk (IP: 10.1.4.240)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-bsjdv-1755357769
  Pod: grpc-echo-server-579dfbc76b-bsjdv (IP: 10.1.4.241)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-dw2m7-1755357770
  Pod: grpc-echo-server-579dfbc76b-dw2m7 (IP: 10.1.4.236)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-x85jm-1755357769
  Pod: grpc-echo-server-579dfbc76b-x85jm (IP: 10.1.4.238)
  Requests: 10 (20.0%)
  ????????

???? Total unique servers: 5

? Load balancing detected across 5 servers!
   Expected requests per server: 10.0
   Distribution variance: 0.00

How Istio Solves This

From Istio’s traffic management documentation:

“Envoy proxies are deployed as sidecars to services, logically augmenting the services with traffic management capabilities… Envoy proxies are the only Istio components that interact with data plane traffic.”

Istio’s solution:

Envoy sidecar intercepts all traffic
Performs L7 (application-level) load balancing
Maintains connection pools to all backends
Routes each request independently

Test 4: Client-Side Load Balancing

Purpose: Test gRPC’s built-in client-side load balancing capabilities.

Standard Client-Side LB

./scripts/test-client-lb.sh

The cmd/client-lb/main.go implements gRPC’s native load balancing:

conn, err := grpc.Dial(
    "dns:///"+target,
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Results and Limitations

 Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-g9pbw-1755359830
  Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
  Requests: 10 (100.0%)
  ????????????????????

???? Total servers hit: 1
??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
? Normal client works - service is accessible

???? Test 2: Client-side round-robin (from inside cluster)
?????????????????????????????????????????????????????
Creating test pod inside cluster for proper DNS resolution...
pod "client-lb-test" deleted
./scripts/test-client-lb.sh: line 71: 48208 Terminated: 15          kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1

??  Client-side LB limitation explanation:
   gRPC client-side round-robin expects multiple A records
   But Kubernetes Services return only one ClusterIP
   Result: 'no children to pick from' error

???? What happens with client-side LB:
   1. Client asks DNS for: grpc-echo-service
   2. DNS returns: 10.105.177.23 (single IP)
   3. gRPC round-robin needs: multiple IPs for load balancing
   4. Result: Error 'no children to pick from'

? This proves client-side LB doesn't work with K8s Services!

???? Test 3: Demonstrating the DNS limitation
?????????????????????????????????????????????
What gRPC client-side LB sees:
   Service name: grpc-echo-service:50051
   DNS resolution: 10.105.177.23:50051
   Available endpoints: 1 (needs multiple for round-robin)

What gRPC client-side LB needs:
   Multiple A records from DNS, like:
   grpc-echo-service ? 10.1.4.241:50051
   grpc-echo-service ? 10.1.4.240:50051
   grpc-echo-service ? 10.1.4.238:50051
   (But Kubernetes Services don't provide this)

???? Test 4: Alternative - Multiple connections
????????????????????????????????????????????
Testing alternative approach with multiple connections...

???? Configuration:
   Target: localhost:50052
   API: grpc.Dial
   Load Balancing: round-robin
   Multi-endpoint: true
   Requests: 20

???? Using multi-endpoint resolver

???? Sending 20 unary requests...

? Request 1 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 2 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 3 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 4 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 5 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 6 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 7 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 8 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 9 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 10 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 11 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)

? Successful requests: 20/20

???? Load Distribution Results:
================================

Server: grpc-echo-server-5b657689db-g9pbw-1755359830
  Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
  Requests: 20 (100.0%)
  ????????????????????????????????????????

???? Total unique servers: 1

??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
   This is expected for gRPC without Istio or special configuration.
? Multi-connection approach works!
   (This simulates multiple endpoints for testing)

???????????????????????????????????????????????????????????????
                         SUMMARY
???????????????????????????????????????????????????????????????

? KEY FINDINGS:
   • Standard gRPC client: Works (uses single connection)
   • Client-side round-robin: Fails (needs multiple IPs)
   • Kubernetes DNS: Returns single ClusterIP only
   • Alternative: Multiple connections can work

???? CONCLUSION:
   Client-side load balancing doesn't work with standard
   Kubernetes Services because they provide only one IP address.
   This proves why Istio (L7 proxy) is needed for gRPC load balancing!

Why this fails: Kubernetes Services provide a single ClusterIP, not multiple IPs for DNS resolution.

From the gRPC load balancing documentation:

“The gRPC client will use the list of IP addresses returned by the name resolver and distribute RPCs among them.”

Alternative: Multiple Connections

Start five instances of servers with different ports:

# Terminal 1
GRPC_PORT=50051 ./bin/server

# Terminal 2  
GRPC_PORT=50052 ./bin/server

# Terminal 3
GRPC_PORT=50053 ./bin/server

# Terminal 4
GRPC_PORT=50054 ./bin/server

# Terminal 5
GRPC_PORT=50055 ./bin/server

The cmd/client-v2/main.go implements manual connection management:

./bin/client-v2 -target localhost:50051 -requests 50 -multi-endpoint

Results:

???? Load Distribution Results:
================================

Server: unknown-1755360953
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360963
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360970
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360980
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360945
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

???? Total unique servers: 5

? Load balancing detected across 5 servers!
   Expected requests per server: 10.0
   Distribution variance: 0.00

Test 5: Advanced Connection Testing

Purpose: Analyze connection patterns and performance implications.

Multiple Connection Strategy

./bin/advanced-client \
  -target localhost:50051 \
  -requests 1000 \
  -clients 10 \
  -connections 5

Results:

???? Detailed Load Distribution Results:
=====================================
Test Duration: 48.303709ms
Total Requests: 1000
Failed Requests: 0
Requests/sec: 20702.34

Server Distribution:

Server: unknown-1755360945
  Pod: unknown (IP: unknown)
  Requests: 1000 (100.0%)
  First seen: 09:18:51.842
  Last seen: 09:18:51.874
  ????????????????????????????????????????

???? Analysis:
Total unique servers: 1
Average requests per server: 1000.00
Standard deviation: 0.00

??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
   This is expected behavior for gRPC without Istio.

Even sophisticated connection pooling can’t overcome the fundamental issue:
• Multiple connections to SAME endpoint = same server
• Advanced client techniques ? load balancing
• Connection management ? request distribution

Performance Comparison

./scripts/benchmark.sh

???? Key Insights:
• Single server: High performance, no load balancing
• Multiple connections: Same performance, still no LB
• Kubernetes: Small overhead, still no LB
• Istio: Small additional overhead, but enables LB
• Client-side LB: Complex setup, limited effectiveness

Official Documentation References

gRPC Load Balancing

From the official gRPC blog:

“Load balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we want to distribute them across all servers.”

The problem: Standard deployments don’t achieve per-call balancing.

Istio’s Solution

From Istio’s service mesh documentation:

“Istio’s data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. These proxies mediate and control all network communication between microservices.”

Kubernetes Service Limitations

From Kubernetes networking concepts:

“kube-proxy… only supports TCP and UDP… doesn’t understand HTTP and doesn’t provide load balancing for HTTP requests.”

Complete Test Results Summary

After running comprehensive tests across all possible gRPC load balancing configurations, here are the definitive results that prove the fundamental limitations and solutions:

???? Core Test Matrix Results

Configuration	Load Balancing	Servers Hit	Distribution	Key Insight
Local gRPC	? None	1/1 (100%)	Single server	Baseline behavior confirmed
Kubernetes + gRPC	? None	1/5 (100%)	Single pod	K8s Services don’t solve it
Kubernetes + Istio	? Perfect	5/5 (20% each)	Even distribution	Istio enables true LB
Client-side LB	? Failed	1/5 (100%)	Single pod	DNS limitation fatal
kubectl port-forward + Istio	? None	1/5 (100%)	Single pod	Testing methodology matters
Advanced multi-connection	? None	1/1 (100%)	Single endpoint	Complex ? effective

???? Detailed Test Scenario Analysis

Scenario 1: Baseline Tests

Local single server:     ? PASS - 50 requests ? 1 server (100%)
Local multiple conn:     ? PASS - 1000 requests ? 1 server (100%)

Insight: Confirms gRPC’s connection persistence behavior. Multiple connections to same endpoint don’t change distribution.

Scenario 2: Kubernetes Standard Deployment

K8s without Istio:      ? PASS - 50 requests ? 1 pod (100%)
Expected behavior:      ? NO load balancing
Actual behavior:        ? NO load balancing

Insight: Standard Kubernetes deployment with 5 replicas provides zero request-level load balancing for gRPC services.

Scenario 3: Istio Service Mesh

K8s with Istio (port-forward):  ??  BYPASS - 50 requests ? 1 pod (100%)
K8s with Istio (in-mesh):       ? SUCCESS - 50 requests ? 5 pods (20% each)

Insight: Istio provides perfect load balancing when tested correctly. Port-forward testing gives false negatives.

Scenario 4: Client-Side Approaches

DNS round-robin:        ? FAIL - "no children to pick from"
Multi-endpoint client:  ? PARTIAL - Works with manual endpoint management
Advanced connections:   ? FAIL - Still single endpoint limitation

Insight: Client-side solutions are complex, fragile, and limited in Kubernetes environments.

???? Deep Technical Analysis

The DNS Problem (Root Cause)

Our testing revealed the fundamental architectural issue:

# What Kubernetes provides
nslookup grpc-echo-service
? 10.105.177.23 (single ClusterIP)

# What gRPC client-side LB needs  
nslookup grpc-echo-service
? 10.1.4.241, 10.1.4.242, 10.1.4.243, 10.1.4.244, 10.1.4.245 (multiple IPs)

Impact: This single vs. multiple IP difference makes client-side load balancing architecturally impossible with standard Kubernetes Services.

Connection Persistence Evidence

Our advanced client test with 1000 requests, 10 concurrent clients, and 5 connections:

Test Duration: 48ms
Requests/sec: 20,702
Servers Hit: 1 (100%)
Connection Reuse: Perfect (efficient but unbalanced)

Conclusion: Even sophisticated connection management can’t overcome the single-endpoint limitation.

Istio’s L7 Magic

Comparing the same test scenario:

# Without Istio
50 requests ? grpc-echo-server-abc123 (100%)

# With Istio  
50 requests ? 5 different pods (20% each)
Distribution variance: 0.00 (perfect)

Technical Detail: Istio’s Envoy sidecar performs request-level routing, creating independent routing decisions for each gRPC call.

? Performance Impact Analysis

Based on our benchmark results:

Configuration	Req/s	Overhead	Load Balancing	Production Suitable
Local baseline	~25,000	0%	None	? Not scalable
K8s standard	~22,000	12%	None	? Unbalanced
K8s + Istio	~20,000	20%	Perfect	? Recommended
Client-side	~23,000	8%	Complex	?? Maintenance burden

Insight: Istio’s 20% performance overhead is a reasonable trade-off for enabling proper load balancing and gaining a production-ready service mesh.

Production Recommendations

For Development Teams:

Standard Kubernetes deployment of gRPC services will not load balance
Istio is the proven solution for production gRPC load balancing
Client-side approaches add complexity without solving the fundamental issue
Testing methodology critically affects results (avoid port-forward for Istio tests)

For Architecture Decisions:

Plan for Istio if deploying multiple gRPC services
Accept the 20% performance cost for operational benefits
Avoid client-side load balancing in Kubernetes environments
Use proper testing practices to validate service mesh behavior

For Production Readiness:

Istio + DestinationRules provide enterprise-grade gRPC load balancing
Monitoring and observability come built-in with Istio
Circuit breaking and retry policies integrate seamlessly
Zero client-side complexity reduces maintenance burden

???? Primary Recommendation: Istio Service Mesh

Our testing proves Istio is the only solution that provides reliable gRPC load balancing in Kubernetes:

# Production-tested DestinationRule configuration
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service-production
spec:
  host: grpc-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10  # Tested: Ensures request distribution
        connectTimeout: 30s
    loadBalancer:
      simple: LEAST_REQUEST  # Better than ROUND_ROBIN for varying request costs
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Why this configuration works:

maxRequestsPerConnection: 10 – Forces connection rotation (tested in our scenario)
LEAST_REQUEST – Better performance than round-robin for real workloads
outlierDetection – Automatic failure handling (something client-side LB can’t provide)

Expected results based on our testing:

? Perfect 20% distribution across 5 replicas
? ~20% performance overhead (trade-off worth it)
? Built-in observability and monitoring
? Zero client-side complexity

???? Configuration Best Practices

1. Enable Istio Injection Properly

# Enable for entire namespace (recommended)
kubectl label namespace production istio-injection=enabled

# Or per-deployment (more control)
metadata:
  annotations:
    sidecar.istio.io/inject: "true"

2. Validate Load Balancing is Working

# WRONG: This will show false negatives
kubectl port-forward service/grpc-service 50051:50051

# CORRECT: Test from inside the mesh
kubectl run test-client --rm -it --restart=Never \
  --image=your-grpc-client \
  --annotations="sidecar.istio.io/inject=true" \
  -- ./client -target grpc-service:50051 -requests 100

3. Monitor Distribution Quality

# Check Envoy stats for load balancing
kubectl exec deployment/grpc-service -c istio-proxy -- \
  curl localhost:15000/stats | grep upstream_rq_

?? What NOT to Do (Based on Our Test Failures)

1. Don’t Rely on Standard Kubernetes Services

# This WILL NOT load balance gRPC traffic
apiVersion: v1
kind: Service
metadata:
  name: grpc-service
spec:
  ports:
  - port: 50051
  selector:
    app: grpc-server
# Result: 100% traffic to single pod (proven in our tests)

2. Don’t Use Client-Side Load Balancing

// This approach FAILS in Kubernetes (tested and failed)
conn, err := grpc.Dial(
    "dns:///grpc-service:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
// Error: "no children to pick from" (proven in our tests)

3. Don’t Implement Complex Connection Pooling

// This adds complexity without solving the core issue
type LoadBalancedClient struct {
    conns []grpc.ClientConnInterface
    next  int64
}
// Still results in 100% traffic to single endpoint (proven in our tests)

???? Alternative Solutions (If Istio Not Available)

If you absolutely cannot use Istio, here are the only viable alternatives (with significant caveats):

Option 1: External Load Balancer with HTTP/2 Support

# Use nginx/envoy/haproxy outside Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: grpc-service-lb
spec:
  type: LoadBalancer
  ports:
  - port: 50051
    targetPort: 50051

Limitations: Requires external infrastructure, loss of Kubernetes-native benefits

Option 2: Headless Service + Custom Service Discovery

apiVersion: v1
kind: Service
metadata:
  name: grpc-service-headless
spec:
  clusterIP: None  # Headless service
  ports:
  - port: 50051
  selector:
    app: grpc-server

Limitations: Complex client implementation, manual health checking

Conclusion

After testing every possible gRPC load balancing configuration in Kubernetes, the evidence is clear and definitive:

Standard Kubernetes + gRPC = Zero load balancing (100% traffic to single pod)
The problem is architectural, not implementation
Client-side solutions fail due to DNS limitations (“no children to pick from”)
Complex workarounds add overhead without solving the core issue

???? Istio is the Proven Solution

The evidence overwhelmingly supports Istio as the production solution:

? Perfect load balancing: 20% distribution across 5 pods (0.00 variance)
? Reasonable overhead: 20% performance cost for complete solution
? Production features: Circuit breaking, retries, observability included
? Zero client complexity: Works transparently with existing gRPC clients

???? Critical Testing Insight

Our testing revealed a major pitfall that leads to incorrect conclusions:

kubectl port-forward bypasses Istio ? false negative results
Most developers get wrong results when testing Istio + gRPC
Always test from inside the service mesh for accurate results

Full test suite and results ?

Comments (0)

August 15, 2025

Building Robust Error Handling with gRPC and REST APIs

Filed under: Computing,Web Services — admin @ 2:23 pm

Introduction

Error handling is often an afterthought in API development, yet it’s one of the most critical aspects of a good developer experience. For example, a cryptic error message like { "error": "An error occurred" } can lead to hours of frustrating debugging. In this guide, we will build a robust, production-grade error handling framework for a Go application that serves both gRPC and a REST/HTTP proxy based on industry standards like RFC9457 (Problem Details for HTTP APIs) and RFC7807 (obsoleted).

Tenets

Following are tenets of a great API error:

Structured: machine-readable, not just a string.
Actionable: explains the developer why the error occurred and, if possible, how to fix it.
Consistent: all errors, from validation to authentication to server faults, follow the same format.
Secure: never leaks sensitive internal information like stack traces or database schemas.

Our North Star for HTTP errors will be the Problem Details for HTTP APIs (RFC 9457/7807):

{
  "type": "https://example.com/docs/errors/validation-failed",
  "title": "Validation Failed",
  "status": 400,
  "detail": "The request body failed validation.",
  "instance": "/v1/todos",
  "invalid_params": [
    {
      "field": "title",
      "reason": "must not be empty"
    }
  ]
}

We will adapt this model for gRPC by embedding a similar structure in the gRPC status details, creating a single source of truth for all errors.

API Design

Let’s start by defining our TODO API in Protocol Buffers:

syntax = "proto3";

package todo.v1;

import "google/api/annotations.proto";
import "google/api/field_behavior.proto";
import "google/api/resource.proto";
import "google/protobuf/timestamp.proto";
import "google/protobuf/field_mask.proto";
import "buf/validate/validate.proto";

option go_package = "github.com/bhatti/todo-api-errors/api/proto/todo/v1;todo";

// TodoService provides task management operations
service TodoService {
  // CreateTask creates a new task
  rpc CreateTask(CreateTaskRequest) returns (Task) {
    option (google.api.http) = {
      post: "/v1/tasks"
      body: "*"
    };
  }

  // GetTask retrieves a specific task
  rpc GetTask(GetTaskRequest) returns (Task) {
    option (google.api.http) = {
      get: "/v1/{name=tasks/*}"
    };
  }

  // ListTasks retrieves all tasks
  rpc ListTasks(ListTasksRequest) returns (ListTasksResponse) {
    option (google.api.http) = {
      get: "/v1/tasks"
    };
  }

  // UpdateTask updates an existing task
  rpc UpdateTask(UpdateTaskRequest) returns (Task) {
    option (google.api.http) = {
      patch: "/v1/{task.name=tasks/*}"
      body: "task"
    };
  }

  // DeleteTask removes a task
  rpc DeleteTask(DeleteTaskRequest) returns (DeleteTaskResponse) {
    option (google.api.http) = {
      delete: "/v1/{name=tasks/*}"
    };
  }

  // BatchCreateTasks creates multiple tasks at once
  rpc BatchCreateTasks(BatchCreateTasksRequest) returns (BatchCreateTasksResponse) {
    option (google.api.http) = {
      post: "/v1/tasks:batchCreate"
      body: "*"
    };
  }
}

// Task represents a TODO item
message Task {
  option (google.api.resource) = {
    type: "todo.example.com/Task"
    pattern: "tasks/{task}"
    singular: "task"
    plural: "tasks"
  };

  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = IDENTIFIER,
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Task title
  string title = 2 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).string = {
      min_len: 1
      max_len: 200
    }
  ];

  // Task description
  string description = 3 [
    (google.api.field_behavior) = OPTIONAL,
    (buf.validate.field).string = {
      max_len: 1000
    }
  ];

  // Task status
  Status status = 4 [
    (google.api.field_behavior) = REQUIRED
  ];

  // Task priority
  Priority priority = 5 [
    (google.api.field_behavior) = OPTIONAL
  ];

  // Due date for the task
  google.protobuf.Timestamp due_date = 6 [
    (google.api.field_behavior) = OPTIONAL,
    (buf.validate.field).timestamp = {
      gt_now: true
    }
  ];

  // Task creation time
  google.protobuf.Timestamp create_time = 7 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Task last update time
  google.protobuf.Timestamp update_time = 8 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // User who created the task
  string created_by = 9 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Tags associated with the task
  repeated string tags = 10 [
    (buf.validate.field).repeated = {
      max_items: 10
      items: {
        string: {
          pattern: "^[a-z0-9-]+$"
          max_len: 50
        }
      }
    }
  ];
}

// Task status enumeration
enum Status {
  STATUS_UNSPECIFIED = 0;
  STATUS_PENDING = 1;
  STATUS_IN_PROGRESS = 2;
  STATUS_COMPLETED = 3;
  STATUS_CANCELLED = 4;
}

// Task priority enumeration
enum Priority {
  PRIORITY_UNSPECIFIED = 0;
  PRIORITY_LOW = 1;
  PRIORITY_MEDIUM = 2;
  PRIORITY_HIGH = 3;
  PRIORITY_CRITICAL = 4;
}

// CreateTaskRequest message
message CreateTaskRequest {
  // Task to create
  Task task = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];
}

// GetTaskRequest message
message GetTaskRequest {
  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = REQUIRED,
    (google.api.resource_reference) = {
      type: "todo.example.com/Task"
    },
    (buf.validate.field).string = {
      pattern: "^tasks/[a-zA-Z0-9-]+$"
    }
  ];
}

// ListTasksRequest message
message ListTasksRequest {
  // Maximum number of tasks to return
  int32 page_size = 1 [
    (buf.validate.field).int32 = {
      gte: 0
      lte: 1000
    }
  ];

  // Page token for pagination
  string page_token = 2;

  // Filter expression
  string filter = 3;

  // Order by expression
  string order_by = 4;
}

// ListTasksResponse message
message ListTasksResponse {
  // List of tasks
  repeated Task tasks = 1;

  // Token for next page
  string next_page_token = 2;

  // Total number of tasks
  int32 total_size = 3;
}

// UpdateTaskRequest message
message UpdateTaskRequest {
  // Task to update
  Task task = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];

  // Fields to update
  google.protobuf.FieldMask update_mask = 2 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];
}

// DeleteTaskRequest message
message DeleteTaskRequest {
  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = REQUIRED,
    (google.api.resource_reference) = {
      type: "todo.example.com/Task"
    }
  ];
}

// DeleteTaskResponse message
message DeleteTaskResponse {
  // Confirmation message
  string message = 1;
}

// BatchCreateTasksRequest message
message BatchCreateTasksRequest {
  // Tasks to create
  repeated CreateTaskRequest requests = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).repeated = {
      min_items: 1
      max_items: 100
    }
  ];
}

// BatchCreateTasksResponse message
message BatchCreateTasksResponse {
  // Created tasks
  repeated Task tasks = 1;
}

syntax = "proto3";

package errors.v1;

import "google/protobuf/timestamp.proto";
import "google/protobuf/any.proto";

option go_package = "github.com/bhatti/todo-api-errors/api/proto/errors/v1;errors";

// ErrorDetail provides a structured, machine-readable error payload.
// It is designed to be embedded in the `details` field of a `google.rpc.Status` message.
message ErrorDetail {
  // A unique, application-specific error code.
  string code = 1;
  // A short, human-readable summary of the problem type.
  string title = 2;
  // A human-readable explanation specific to this occurrence of the problem.
  string detail = 3;
  // A list of validation errors, useful for INVALID_ARGUMENT responses.
  repeated FieldViolation field_violations = 4;
  // Optional trace ID for request correlation
  string trace_id = 5;
  // Optional timestamp when the error occurred
  google.protobuf.Timestamp timestamp = 6;
  // Optional instance path where the error occurred
  string instance = 7;
  // Optional extensions for additional error context
  map<string, google.protobuf.Any> extensions = 8;
}

// Describes a single validation failure.
message FieldViolation {
  // The path to the field that failed validation, e.g., "title".
  string field = 1;
  // A developer-facing description of the validation rule that failed.
  string description = 2;
  // Application-specific error code for this validation failure
  string code = 3;
}

// AppErrorCode defines a list of standardized, application-specific error codes.
enum AppErrorCode {
  APP_ERROR_CODE_UNSPECIFIED = 0;

  // Validation failures
  VALIDATION_FAILED = 1;
  REQUIRED_FIELD = 2;
  TOO_SHORT = 3;
  TOO_LONG = 4;
  INVALID_FORMAT = 5;
  MUST_BE_FUTURE = 6;
  INVALID_VALUE = 7;
  DUPLICATE_TAG = 8;
  INVALID_TAG_FORMAT = 9;
  OVERDUE_COMPLETION = 10;
  EMPTY_BATCH = 11;
  BATCH_TOO_LARGE = 12;
  DUPLICATE_TITLE = 13;

  // Resource errors
  RESOURCE_NOT_FOUND = 1001;
  RESOURCE_CONFLICT = 1002;

  // Authentication and authorization
  AUTHENTICATION_FAILED = 2001;
  PERMISSION_DENIED = 2002;

  // Rate limiting and service availability
  RATE_LIMIT_EXCEEDED = 3001;
  SERVICE_UNAVAILABLE = 3002;

  // Internal errors
  INTERNAL_ERROR = 9001;
}

Error Handling Implementation

Now let’s implement our error handling framework:

package errors

import (
	"fmt"

	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	"google.golang.org/genproto/googleapis/rpc/errdetails"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/types/known/anypb"
	"google.golang.org/protobuf/types/known/timestamppb"
)

// AppError is our custom error type using protobuf definitions.
type AppError struct {
	GRPCCode        codes.Code
	AppCode         errorspb.AppErrorCode
	Title           string
	Detail          string
	FieldViolations []*errorspb.FieldViolation
	TraceID         string
	Instance        string
	Extensions      map[string]*anypb.Any
	CausedBy        error // For internal logging
}

func (e *AppError) Error() string {
	return fmt.Sprintf("gRPC Code: %s, App Code: %s, Title: %s, Detail: %s", e.GRPCCode, e.AppCode, e.Title, e.Detail)
}

// ToGRPCStatus converts our AppError into a gRPC status.Status.
func (e *AppError) ToGRPCStatus() *status.Status {
	st := status.New(e.GRPCCode, e.Title)

	errorDetail := &errorspb.ErrorDetail{
		Code:            e.AppCode.String(),
		Title:           e.Title,
		Detail:          e.Detail,
		FieldViolations: e.FieldViolations,
		TraceId:         e.TraceID,
		Timestamp:       timestamppb.Now(),
		Instance:        e.Instance,
		Extensions:      e.Extensions,
	}

	// For validation errors, we also attach the standard BadRequest detail
	// so that gRPC-Gateway and other standard tools can understand it.
	if e.GRPCCode == codes.InvalidArgument && len(e.FieldViolations) > 0 {
		br := &errdetails.BadRequest{}
		for _, fv := range e.FieldViolations {
			br.FieldViolations = append(br.FieldViolations, &errdetails.BadRequest_FieldViolation{
				Field:       fv.Field,
				Description: fv.Description,
			})
		}
		st, _ = st.WithDetails(br, errorDetail)
		return st
	}

	st, _ = st.WithDetails(errorDetail)
	return st
}

// Helper functions for creating common errors

func NewValidationFailed(violations []*errorspb.FieldViolation, traceID string) *AppError {
	return &AppError{
		GRPCCode:        codes.InvalidArgument,
		AppCode:         errorspb.AppErrorCode_VALIDATION_FAILED,
		Title:           "Validation Failed",
		Detail:          fmt.Sprintf("The request contains %d validation errors", len(violations)),
		FieldViolations: violations,
		TraceID:         traceID,
	}
}

func NewNotFound(resource string, id string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.NotFound,
		AppCode:  errorspb.AppErrorCode_RESOURCE_NOT_FOUND,
		Title:    "Resource Not Found",
		Detail:   fmt.Sprintf("%s with ID '%s' was not found.", resource, id),
		TraceID:  traceID,
	}
}

func NewConflict(resource, reason string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.AlreadyExists,
		AppCode:  errorspb.AppErrorCode_RESOURCE_CONFLICT,
		Title:    "Resource Conflict",
		Detail:   fmt.Sprintf("Conflict creating %s: %s", resource, reason),
		TraceID:  traceID,
	}
}

func NewInternal(message string, traceID string, causedBy error) *AppError {
	return &AppError{
		GRPCCode: codes.Internal,
		AppCode:  errorspb.AppErrorCode_INTERNAL_ERROR,
		Title:    "Internal Server Error",
		Detail:   message,
		TraceID:  traceID,
		CausedBy: causedBy,
	}
}

func NewPermissionDenied(resource, action string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.PermissionDenied,
		AppCode:  errorspb.AppErrorCode_PERMISSION_DENIED,
		Title:    "Permission Denied",
		Detail:   fmt.Sprintf("You don't have permission to %s %s", action, resource),
		TraceID:  traceID,
	}
}

func NewServiceUnavailable(message string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.Unavailable,
		AppCode:  errorspb.AppErrorCode_SERVICE_UNAVAILABLE,
		Title:    "Service Unavailable",
		Detail:   message,
		TraceID:  traceID,
	}
}

func NewRequiredField(field, message string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.InvalidArgument,
		AppCode:  errorspb.AppErrorCode_VALIDATION_FAILED,
		Title:    "Validation Failed",
		Detail:   "The request contains validation errors",
		FieldViolations: []*errorspb.FieldViolation{
			{
				Field:       field,
				Code:        errorspb.AppErrorCode_REQUIRED_FIELD.String(),
				Description: message,
			},
		},
		TraceID: traceID,
	}
}

Validation Framework

Let’s implement validation that returns all errors at once:

package validation

import (
	"errors"
	"fmt"
	"regexp"
	"strings"

	"buf.build/gen/go/bufbuild/protovalidate/protocolbuffers/go/buf/validate"
	"buf.build/go/protovalidate"
	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"google.golang.org/protobuf/proto"
)

var pv protovalidate.Validator

func init() {
	var err error
	pv, err = protovalidate.New()
	if err != nil {
		panic(fmt.Sprintf("failed to initialize protovalidator: %v", err))
	}
}

// ValidateRequest checks a proto message and returns an AppError with all violations.
func ValidateRequest(req proto.Message, traceID string) error {
	if err := pv.Validate(req); err != nil {
		var validationErrs *protovalidate.ValidationError
		if errors.As(err, &validationErrs) {
			var violations []*errorspb.FieldViolation
			for _, violation := range validationErrs.Violations {
				fieldPath := ""
				if violation.Proto.GetField() != nil {
					fieldPath = formatFieldPath(violation.Proto.GetField())
				}

				ruleId := violation.Proto.GetRuleId()
				message := violation.Proto.GetMessage()

				violations = append(violations, &errorspb.FieldViolation{
					Field:       fieldPath,
					Description: message,
					Code:        mapConstraintToCode(ruleId),
				})
			}
			return apperrors.NewValidationFailed(violations, traceID)
		}
		return apperrors.NewInternal("Validation failed", traceID, err)
	}
	return nil
}

// ValidateTask performs additional business logic validation
func ValidateTask(task *todopb.Task, traceID string) error {
	var violations []*errorspb.FieldViolation

	// Proto validation first
	if err := ValidateRequest(task, traceID); err != nil {
		if appErr, ok := err.(*apperrors.AppError); ok {
			violations = append(violations, appErr.FieldViolations...)
		}
	}

	// Additional business rules
	if task.Status == todopb.Status_STATUS_COMPLETED && task.DueDate != nil {
		if task.UpdateTime != nil && task.UpdateTime.AsTime().After(task.DueDate.AsTime()) {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       "due_date",
				Code:        errorspb.AppErrorCode_OVERDUE_COMPLETION.String(),
				Description: "Task was completed after the due date",
			})
		}
	}

	// Validate tags format
	for i, tag := range task.Tags {
		if !isValidTag(tag) {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("tags[%d]", i),
				Code:        errorspb.AppErrorCode_INVALID_TAG_FORMAT.String(),
				Description: fmt.Sprintf("Tag '%s' must be lowercase letters, numbers, and hyphens only", tag),
			})
		}
	}

	// Check for duplicate tags
	tagMap := make(map[string]bool)
	for i, tag := range task.Tags {
		if tagMap[tag] {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("tags[%d]", i),
				Code:        errorspb.AppErrorCode_DUPLICATE_TAG.String(),
				Description: fmt.Sprintf("Tag '%s' appears multiple times", tag),
			})
		}
		tagMap[tag] = true
	}

	if len(violations) > 0 {
		return apperrors.NewValidationFailed(violations, traceID)
	}

	return nil
}

// ValidateBatchCreateTasks validates batch operations
func ValidateBatchCreateTasks(req *todopb.BatchCreateTasksRequest, traceID string) error {
	var violations []*errorspb.FieldViolation

	// Check batch size
	if len(req.Requests) == 0 {
		violations = append(violations, &errorspb.FieldViolation{
			Field:       "requests",
			Code:        errorspb.AppErrorCode_EMPTY_BATCH.String(),
			Description: "Batch must contain at least one task",
		})
	}

	if len(req.Requests) > 100 {
		violations = append(violations, &errorspb.FieldViolation{
			Field:       "requests",
			Code:        errorspb.AppErrorCode_BATCH_TOO_LARGE.String(),
			Description: fmt.Sprintf("Batch size %d exceeds maximum of 100", len(req.Requests)),
		})
	}

	// Validate each task
	for i, createReq := range req.Requests {
		if createReq.Task == nil {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("requests[%d].task", i),
				Code:        errorspb.AppErrorCode_REQUIRED_FIELD.String(),
				Description: "Task is required",
			})
			continue
		}

		// Validate task
		if err := ValidateTask(createReq.Task, traceID); err != nil {
			if appErr, ok := err.(*apperrors.AppError); ok {
				for _, violation := range appErr.FieldViolations {
					violation.Field = fmt.Sprintf("requests[%d].task.%s", i, violation.Field)
					violations = append(violations, violation)
				}
			}
		}
	}

	// Check for duplicate titles
	titleMap := make(map[string][]int)
	for i, createReq := range req.Requests {
		if createReq.Task != nil && createReq.Task.Title != "" {
			titleMap[createReq.Task.Title] = append(titleMap[createReq.Task.Title], i)
		}
	}

	for title, indices := range titleMap {
		if len(indices) > 1 {
			for _, idx := range indices {
				violations = append(violations, &errorspb.FieldViolation{
					Field:       fmt.Sprintf("requests[%d].task.title", idx),
					Code:        errorspb.AppErrorCode_DUPLICATE_TITLE.String(),
					Description: fmt.Sprintf("Title '%s' is used by multiple tasks in the batch", title),
				})
			}
		}
	}

	if len(violations) > 0 {
		return apperrors.NewValidationFailed(violations, traceID)
	}

	return nil
}

// Helper functions
func formatFieldPath(fieldPath *validate.FieldPath) string {
	if fieldPath == nil {
		return ""
	}

	// Build field path from elements
	var parts []string
	for _, element := range fieldPath.GetElements() {
		if element.GetFieldName() != "" {
			parts = append(parts, element.GetFieldName())
		} else if element.GetFieldNumber() != 0 {
			parts = append(parts, fmt.Sprintf("field_%d", element.GetFieldNumber()))
		}
	}

	return strings.Join(parts, ".")
}

func mapConstraintToCode(ruleId string) string {
	switch {
	case strings.Contains(ruleId, "required"):
		return errorspb.AppErrorCode_REQUIRED_FIELD.String()
	case strings.Contains(ruleId, "min_len"):
		return errorspb.AppErrorCode_TOO_SHORT.String()
	case strings.Contains(ruleId, "max_len"):
		return errorspb.AppErrorCode_TOO_LONG.String()
	case strings.Contains(ruleId, "pattern"):
		return errorspb.AppErrorCode_INVALID_FORMAT.String()
	case strings.Contains(ruleId, "gt_now"):
		return errorspb.AppErrorCode_MUST_BE_FUTURE.String()
	case ruleId == "":
		return errorspb.AppErrorCode_VALIDATION_FAILED.String()
	default:
		return errorspb.AppErrorCode_INVALID_VALUE.String()
	}
}

var validTagPattern = regexp.MustCompile(`^[a-z0-9-]+$`)

func isValidTag(tag string) bool {
	return len(tag) <= 50 && validTagPattern.MatchString(tag)
}

Error Handler Middleware

Now let’s create middleware to handle errors consistently:

package middleware

import (
	"context"
	"errors"
	"log"

	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"google.golang.org/grpc"
	"google.golang.org/grpc/status"
)

// UnaryErrorInterceptor translates application errors into gRPC statuses.
func UnaryErrorInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
	resp, err := handler(ctx, req)
	if err == nil {
		return resp, nil
	}

	var appErr *apperrors.AppError
	if errors.As(err, &appErr) {
		if appErr.CausedBy != nil {
			log.Printf("ERROR: %s, Original cause: %v", appErr.Title, appErr.CausedBy)
		}
		return nil, appErr.ToGRPCStatus().Err()
	}

	if _, ok := status.FromError(err); ok {
		return nil, err // Already a gRPC status
	}

	log.Printf("UNEXPECTED ERROR: %v", err)
	return nil, apperrors.NewInternal("An unexpected error occurred", "", err).ToGRPCStatus().Err()
}

package middleware

import (
	"context"
	"encoding/json"
	"net/http"
	"runtime/debug"
	"time"

	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"github.com/google/uuid"
	"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
	"go.opentelemetry.io/otel/trace"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/encoding/protojson"
)

// HTTPErrorHandler handles errors for HTTP endpoints
func HTTPErrorHandler(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Add trace ID to context
		traceID := r.Header.Get("X-Trace-ID")
		if traceID == "" {
			traceID = uuid.New().String()
		}
		ctx := context.WithValue(r.Context(), "traceID", traceID)
		r = r.WithContext(ctx)

		// Create response wrapper to intercept errors
		wrapped := &responseWriter{
			ResponseWriter: w,
			request:        r,
			traceID:        traceID,
		}

		// Handle panics
		defer func() {
			if err := recover(); err != nil {
				handlePanic(wrapped, err)
			}
		}()

		// Process request
		next.ServeHTTP(wrapped, r)
	})
}

// responseWriter wraps http.ResponseWriter to intercept errors
type responseWriter struct {
	http.ResponseWriter
	request    *http.Request
	traceID    string
	statusCode int
	written    bool
}

func (w *responseWriter) WriteHeader(code int) {
	if !w.written {
		w.statusCode = code
		w.ResponseWriter.WriteHeader(code)
		w.written = true
	}
}

func (w *responseWriter) Write(b []byte) (int, error) {
	if !w.written {
		w.WriteHeader(http.StatusOK)
	}
	return w.ResponseWriter.Write(b)
}

// handlePanic converts panics to proper error responses
func handlePanic(w *responseWriter, recovered interface{}) {
	// Log stack trace
	debug.PrintStack()

	appErr := apperrors.NewInternal("An unexpected error occurred. Please try again later.", w.traceID, nil)
	writeErrorResponse(w, appErr)
}

// CustomHTTPError handles gRPC gateway error responses
func CustomHTTPError(ctx context.Context, mux *runtime.ServeMux,
	marshaler runtime.Marshaler, w http.ResponseWriter, r *http.Request, err error) {

	// Extract trace ID
	traceID := r.Header.Get("X-Trace-ID")
	if traceID == "" {
		if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
			traceID = span.SpanContext().TraceID().String()
		} else {
			traceID = uuid.New().String()
		}
	}

	// Convert gRPC error to HTTP response
	st, _ := status.FromError(err)

	// Check if we have our custom error detail in status details
	for _, detail := range st.Details() {
		if errorDetail, ok := detail.(*errorspb.ErrorDetail); ok {
			// Update the error detail with current request context
			errorDetail.TraceId = traceID
			errorDetail.Instance = r.URL.Path

			// Convert to JSON and write response
			w.Header().Set("Content-Type", "application/problem+json")
			w.WriteHeader(runtime.HTTPStatusFromCode(st.Code()))

			// Create a simplified JSON response that matches RFC 7807
			response := map[string]interface{}{
				"type":      getTypeForCode(errorDetail.Code),
				"title":     errorDetail.Title,
				"status":    runtime.HTTPStatusFromCode(st.Code()),
				"detail":    errorDetail.Detail,
				"instance":  errorDetail.Instance,
				"traceId":   errorDetail.TraceId,
				"timestamp": errorDetail.Timestamp,
			}

			// Add field violations if present
			if len(errorDetail.FieldViolations) > 0 {
				violations := make([]map[string]interface{}, len(errorDetail.FieldViolations))
				for i, fv := range errorDetail.FieldViolations {
					violations[i] = map[string]interface{}{
						"field":   fv.Field,
						"code":    fv.Code,
						"message": fv.Description,
					}
				}
				response["errors"] = violations
			}

			// Add extensions if present
			if len(errorDetail.Extensions) > 0 {
				extensions := make(map[string]interface{})
				for k, v := range errorDetail.Extensions {
					// Convert Any to JSON
					if jsonBytes, err := protojson.Marshal(v); err == nil {
						var jsonData interface{}
						if err := json.Unmarshal(jsonBytes, &jsonData); err == nil {
							extensions[k] = jsonData
						}
					}
				}
				if len(extensions) > 0 {
					response["extensions"] = extensions
				}
			}

			if err := json.NewEncoder(w).Encode(response); err != nil {
				http.Error(w, `{"error": "Failed to encode error response"}`, 500)
			}
			return
		}
	}

	// Fallback: create new error response
	fallbackErr := apperrors.NewInternal(st.Message(), traceID, nil)
	fallbackErr.GRPCCode = st.Code()
	writeAppErrorResponse(w, fallbackErr, r.URL.Path)
}

// Helper functions
func getTypeForCode(code string) string {
	switch code {
	case errorspb.AppErrorCode_VALIDATION_FAILED.String():
		return "https://api.example.com/errors/validation-failed"
	case errorspb.AppErrorCode_RESOURCE_NOT_FOUND.String():
		return "https://api.example.com/errors/resource-not-found"
	case errorspb.AppErrorCode_RESOURCE_CONFLICT.String():
		return "https://api.example.com/errors/resource-conflict"
	case errorspb.AppErrorCode_PERMISSION_DENIED.String():
		return "https://api.example.com/errors/permission-denied"
	case errorspb.AppErrorCode_INTERNAL_ERROR.String():
		return "https://api.example.com/errors/internal-error"
	case errorspb.AppErrorCode_SERVICE_UNAVAILABLE.String():
		return "https://api.example.com/errors/service-unavailable"
	default:
		return "https://api.example.com/errors/unknown"
	}
}

func writeErrorResponse(w http.ResponseWriter, err error) {
	if appErr, ok := err.(*apperrors.AppError); ok {
		writeAppErrorResponse(w, appErr, "")
	} else {
		http.Error(w, err.Error(), http.StatusInternalServerError)
	}
}

func writeAppErrorResponse(w http.ResponseWriter, appErr *apperrors.AppError, instance string) {
	statusCode := runtime.HTTPStatusFromCode(appErr.GRPCCode)

	response := map[string]interface{}{
		"type":      getTypeForCode(appErr.AppCode.String()),
		"title":     appErr.Title,
		"status":    statusCode,
		"detail":    appErr.Detail,
		"traceId":   appErr.TraceID,
		"timestamp": time.Now(),
	}

	if instance != "" {
		response["instance"] = instance
	}

	if len(appErr.FieldViolations) > 0 {
		violations := make([]map[string]interface{}, len(appErr.FieldViolations))
		for i, fv := range appErr.FieldViolations {
			violations[i] = map[string]interface{}{
				"field":   fv.Field,
				"code":    fv.Code,
				"message": fv.Description,
			}
		}
		response["errors"] = violations
	}

	w.Header().Set("Content-Type", "application/problem+json")
	w.WriteHeader(statusCode)
	json.NewEncoder(w).Encode(response)
}

Service Implementation

Now let’s implement our TODO service with proper error handling:

package service

import (
	"context"
	"fmt"
	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	"github.com/bhatti/todo-api-errors/internal/errors"
	"github.com/bhatti/todo-api-errors/internal/repository"
	"github.com/bhatti/todo-api-errors/internal/validation"
	"github.com/google/uuid"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/trace"
	"google.golang.org/protobuf/types/known/fieldmaskpb"
	"google.golang.org/protobuf/types/known/timestamppb"
	"strings"
)

var tracer = otel.Tracer("todo-service")

// TodoService implements the TODO API
type TodoService struct {
	todopb.UnimplementedTodoServiceServer
	repo repository.TodoRepository
}

// NewTodoService creates a new TODO service
func NewTodoService(repo repository.TodoRepository) (*TodoService, error) {
	return &TodoService{
		repo: repo,
	}, nil
}

// CreateTask creates a new task
func (s *TodoService) CreateTask(ctx context.Context, req *todopb.CreateTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "CreateTask")
	defer span.End()

	// Get trace ID for error responses
	traceID := span.SpanContext().TraceID().String()

	// Validate request
	if req.Task == nil {
		return nil, errors.NewRequiredField("task", "Task object is required", traceID)
	}

	// Validate task fields using the new validation package
	if err := validation.ValidateTask(req.Task, traceID); err != nil {
		span.SetAttributes(attribute.String("validation.error", err.Error()))
		return nil, err
	}

	// Check for duplicate title
	existing, err := s.repo.GetTaskByTitle(ctx, req.Task.Title)
	if err != nil && !repository.IsNotFound(err) {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	if existing != nil {
		return nil, errors.NewConflict("task", "A task with this title already exists", traceID)
	}

	// Generate task ID
	taskID := uuid.New().String()
	task := &todopb.Task{
		Name:        fmt.Sprintf("tasks/%s", taskID),
		Title:       req.Task.Title,
		Description: req.Task.Description,
		Status:      req.Task.Status,
		Priority:    req.Task.Priority,
		DueDate:     req.Task.DueDate,
		Tags:        req.Task.Tags,
		CreateTime:  timestamppb.Now(),
		UpdateTime:  timestamppb.Now(),
		CreatedBy:   s.getUserFromContext(ctx),
	}

	// Set defaults
	if task.Status == todopb.Status_STATUS_UNSPECIFIED {
		task.Status = todopb.Status_STATUS_PENDING
	}
	if task.Priority == todopb.Priority_PRIORITY_UNSPECIFIED {
		task.Priority = todopb.Priority_PRIORITY_MEDIUM
	}

	// Save to repository
	if err := s.repo.CreateTask(ctx, task); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	span.SetAttributes(
		attribute.String("task.id", taskID),
		attribute.String("task.title", task.Title),
	)

	return task, nil
}

// GetTask retrieves a specific task
func (s *TodoService) GetTask(ctx context.Context, req *todopb.GetTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "GetTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Extract task ID
	parts := strings.Split(req.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("name", "Task name must be in format 'tasks/{id}'", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get from repository
	task, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canAccessTask(ctx, task) {
		return nil, errors.NewPermissionDenied("task", "read", traceID)
	}

	return task, nil
}

// ListTasks retrieves all tasks
func (s *TodoService) ListTasks(ctx context.Context, req *todopb.ListTasksRequest) (*todopb.ListTasksResponse, error) {
	ctx, span := tracer.Start(ctx, "ListTasks")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Default page size
	pageSize := req.PageSize
	if pageSize == 0 {
		pageSize = 50
	}
	if pageSize > 1000 {
		pageSize = 1000
	}

	span.SetAttributes(
		attribute.Int("page.size", int(pageSize)),
		attribute.String("filter", req.Filter),
	)

	// Parse filter
	filter, err := s.parseFilter(req.Filter)
	if err != nil {
		return nil, errors.NewRequiredField("filter", fmt.Sprintf("Failed to parse filter: %v", err), traceID)
	}

	// Get tasks from repository
	tasks, nextPageToken, err := s.repo.ListTasks(ctx, repository.ListOptions{
		PageSize:  int(pageSize),
		PageToken: req.PageToken,
		Filter:    filter,
		OrderBy:   req.OrderBy,
		UserID:    s.getUserFromContext(ctx),
	})

	if err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Get total count
	totalSize, err := s.repo.CountTasks(ctx, filter, s.getUserFromContext(ctx))
	if err != nil {
		// Log but don't fail the request
		span.RecordError(err)
		totalSize = -1
	}

	return &todopb.ListTasksResponse{
		Tasks:         tasks,
		NextPageToken: nextPageToken,
		TotalSize:     int32(totalSize),
	}, nil
}

// UpdateTask updates an existing task
func (s *TodoService) UpdateTask(ctx context.Context, req *todopb.UpdateTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "UpdateTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request
	if req.Task == nil {
		return nil, errors.NewRequiredField("task", "Task object is required", traceID)
	}

	if req.UpdateMask == nil || len(req.UpdateMask.Paths) == 0 {
		return nil, errors.NewRequiredField("update_mask", "Update mask must specify which fields to update", traceID)
	}

	// Extract task ID
	parts := strings.Split(req.Task.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("task.name", "Invalid task name format", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get existing task
	existing, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canModifyTask(ctx, existing) {
		return nil, errors.NewPermissionDenied("task", "update", traceID)
	}

	// Apply updates based on field mask
	updated := s.applyFieldMask(existing, req.Task, req.UpdateMask)
	updated.UpdateTime = timestamppb.Now()

	// Validate updated task using the new validation package
	if err := validation.ValidateTask(updated, traceID); err != nil {
		return nil, err
	}

	// Save to repository
	if err := s.repo.UpdateTask(ctx, updated); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	return updated, nil
}

// DeleteTask removes a task
func (s *TodoService) DeleteTask(ctx context.Context, req *todopb.DeleteTaskRequest) (*todopb.DeleteTaskResponse, error) {
	ctx, span := tracer.Start(ctx, "DeleteTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Extract task ID
	parts := strings.Split(req.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("name", "Invalid task name format", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get existing task to check permissions
	existing, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canModifyTask(ctx, existing) {
		return nil, errors.NewPermissionDenied("task", "delete", traceID)
	}

	// Delete from repository
	if err := s.repo.DeleteTask(ctx, taskID); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	return &todopb.DeleteTaskResponse{
		Message: fmt.Sprintf("Task %s deleted successfully", req.Name),
	}, nil
}

// BatchCreateTasks creates multiple tasks at once
func (s *TodoService) BatchCreateTasks(ctx context.Context, req *todopb.BatchCreateTasksRequest) (*todopb.BatchCreateTasksResponse, error) {
	ctx, span := tracer.Start(ctx, "BatchCreateTasks")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate batch request using the new validation package
	if err := validation.ValidateBatchCreateTasks(req, traceID); err != nil {
		span.SetAttributes(attribute.String("validation.error", err.Error()))
		return nil, err
	}

	// Process each task
	var created []*todopb.Task
	var batchErrors []string

	for i, createReq := range req.Requests {
		task, err := s.CreateTask(ctx, createReq)
		if err != nil {
			// Collect errors for batch response
			batchErrors = append(batchErrors, fmt.Sprintf("Task %d: %s", i, err.Error()))
			continue
		}
		created = append(created, task)
	}

	// If all tasks failed, return error
	if len(created) == 0 && len(batchErrors) > 0 {
		return nil, errors.NewInternal("All batch operations failed", traceID, nil)
	}

	// Return partial success
	response := &todopb.BatchCreateTasksResponse{
		Tasks: created,
	}

	// Add partial errors to response metadata if any
	if len(batchErrors) > 0 {
		span.SetAttributes(
			attribute.Int("batch.total", len(req.Requests)),
			attribute.Int("batch.success", len(created)),
			attribute.Int("batch.failed", len(batchErrors)),
		)
	}

	return response, nil
}

// Helper methods

func (s *TodoService) handleRepositoryError(err error, traceID string) error {
	if repository.IsConnectionError(err) {
		return errors.NewServiceUnavailable("Unable to connect to the database. Please try again later.", traceID)
	}

	// Log internal error details
	span := trace.SpanFromContext(context.Background())
	if span != nil {
		span.RecordError(err)
	}

	return errors.NewInternal("An unexpected error occurred while processing your request", traceID, err)
}

func (s *TodoService) getUserFromContext(ctx context.Context) string {
	// In a real implementation, this would extract user info from auth context
	if user, ok := ctx.Value("user").(string); ok {
		return user
	}
	return "anonymous"
}

func (s *TodoService) canAccessTask(ctx context.Context, task *todopb.Task) bool {
	// In a real implementation, check if user can access this task
	user := s.getUserFromContext(ctx)
	return user == task.CreatedBy || user == "admin"
}

func (s *TodoService) canModifyTask(ctx context.Context, task *todopb.Task) bool {
	// In a real implementation, check if user can modify this task
	user := s.getUserFromContext(ctx)
	return user == task.CreatedBy || user == "admin"
}

func (s *TodoService) parseFilter(filter string) (map[string]interface{}, error) {
	// Simple filter parser - in production, use a proper parser
	parsed := make(map[string]interface{})

	if filter == "" {
		return parsed, nil
	}

	// Example: "status=COMPLETED AND priority=HIGH"
	parts := strings.Split(filter, " AND ")
	for _, part := range parts {
		kv := strings.Split(strings.TrimSpace(part), "=")
		if len(kv) != 2 {
			return nil, fmt.Errorf("invalid filter expression: %s", part)
		}

		key := strings.TrimSpace(kv[0])
		value := strings.Trim(strings.TrimSpace(kv[1]), "'\"")

		// Validate filter keys
		switch key {
		case "status", "priority", "created_by":
			parsed[key] = value
		default:
			return nil, fmt.Errorf("unknown filter field: %s", key)
		}
	}

	return parsed, nil
}

func (s *TodoService) applyFieldMask(existing, update *todopb.Task, mask *fieldmaskpb.FieldMask) *todopb.Task {
	result := *existing

	for _, path := range mask.Paths {
		switch path {
		case "title":
			result.Title = update.Title
		case "description":
			result.Description = update.Description
		case "status":
			result.Status = update.Status
		case "priority":
			result.Priority = update.Priority
		case "due_date":
			result.DueDate = update.DueDate
		case "tags":
			result.Tags = update.Tags
		}
	}
	return &result
}

Server Implementation

Now let’s put it all together in our server:

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	"github.com/bhatti/todo-api-errors/internal/middleware"
	"github.com/bhatti/todo-api-errors/internal/monitoring"
	"github.com/bhatti/todo-api-errors/internal/repository"
	"github.com/bhatti/todo-api-errors/internal/service"

	"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/credentials/insecure"
	"google.golang.org/grpc/reflection"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/encoding/protojson"
)

func main() {
	// Initialize monitoring
	if err := monitoring.InitOpenTelemetryMetrics(); err != nil {
		log.Printf("Failed to initialize OpenTelemetry metrics: %v", err)
		// Continue without OpenTelemetry - Prometheus will still work
	}

	// Initialize repository
	repo := repository.NewInMemoryRepository()

	// Initialize service
	todoService, err := service.NewTodoService(repo)
	if err != nil {
		log.Fatalf("Failed to create service: %v", err)
	}

	// Start gRPC server
	grpcPort := ":50051"
	go func() {
		if err := startGRPCServer(grpcPort, todoService); err != nil {
			log.Fatalf("Failed to start gRPC server: %v", err)
		}
	}()

	// Start HTTP gateway
	httpPort := ":8080"
	go func() {
		if err := startHTTPGateway(httpPort, grpcPort); err != nil {
			log.Fatalf("Failed to start HTTP gateway: %v", err)
		}
	}()

	// Start metrics server
	go func() {
		http.Handle("/metrics", promhttp.Handler())
		if err := http.ListenAndServe(":9090", nil); err != nil {
			log.Printf("Failed to start metrics server: %v", err)
		}
	}()

	log.Printf("TODO API server started")
	log.Printf("gRPC server listening on %s", grpcPort)
	log.Printf("HTTP gateway listening on %s", httpPort)
	log.Printf("Metrics available at :9090/metrics")

	// Wait for interrupt signal
	sigCh := make(chan os.Signal, 1)
	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
	<-sigCh

	log.Println("Shutting down...")
}

func startGRPCServer(port string, todoService todopb.TodoServiceServer) error {
	lis, err := net.Listen("tcp", port)
	if err != nil {
		return fmt.Errorf("failed to listen: %w", err)
	}

	// Create gRPC server with interceptors - now using the new UnaryErrorInterceptor
	opts := []grpc.ServerOption{
		grpc.ChainUnaryInterceptor(
			middleware.UnaryErrorInterceptor, // Using new protobuf-based error interceptor
			loggingInterceptor(),
			recoveryInterceptor(),
		),
	}

	server := grpc.NewServer(opts...)

	// Register service
	todopb.RegisterTodoServiceServer(server, todoService)

	// Register reflection for debugging
	reflection.Register(server)

	return server.Serve(lis)
}

func startHTTPGateway(httpPort, grpcPort string) error {
	ctx := context.Background()

	// Create gRPC connection
	conn, err := grpc.DialContext(
		ctx,
		"localhost"+grpcPort,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)
	if err != nil {
		return fmt.Errorf("failed to dial gRPC server: %w", err)
	}

	// Create gateway mux with custom error handler
	mux := runtime.NewServeMux(
		runtime.WithErrorHandler(middleware.CustomHTTPError), // Using new protobuf-based error handler
		runtime.WithMarshalerOption(runtime.MIMEWildcard, &runtime.JSONPb{
			MarshalOptions: protojson.MarshalOptions{
				UseProtoNames:   true,
				EmitUnpopulated: false,
			},
			UnmarshalOptions: protojson.UnmarshalOptions{
				DiscardUnknown: true,
			},
		}),
	)

	// Register service handler
	if err := todopb.RegisterTodoServiceHandler(ctx, mux, conn); err != nil {
		return fmt.Errorf("failed to register service handler: %w", err)
	}

	// Create HTTP server with middleware
	handler := middleware.HTTPErrorHandler( // Using new protobuf-based HTTP error handler
		corsMiddleware(
			authMiddleware(
				loggingHTTPMiddleware(mux),
			),
		),
	)

	server := &http.Server{
		Addr:         httpPort,
		Handler:      handler,
		ReadTimeout:  10 * time.Second,
		WriteTimeout: 10 * time.Second,
		IdleTimeout:  120 * time.Second,
	}

	return server.ListenAndServe()
}

// Middleware implementations

func loggingInterceptor() grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
		start := time.Now()

		// Call handler
		resp, err := handler(ctx, req)

		// Log request
		duration := time.Since(start)
		statusCode := "OK"
		if err != nil {
			statusCode = status.Code(err).String()
		}

		log.Printf("gRPC: %s %s %s %v", info.FullMethod, statusCode, duration, err)

		return resp, err
	}
}

func recoveryInterceptor() grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) {
		defer func() {
			if r := recover(); r != nil {
				log.Printf("Recovered from panic: %v", r)
				monitoring.RecordPanicRecovery(ctx)
				err = status.Error(codes.Internal, "Internal server error")
			}
		}()

		return handler(ctx, req)
	}
}

func loggingHTTPMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()

		// Wrap response writer to capture status
		wrapped := &statusResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}

		// Process request
		next.ServeHTTP(wrapped, r)

		// Log request
		duration := time.Since(start)
		log.Printf("HTTP: %s %s %d %v", r.Method, r.URL.Path, wrapped.statusCode, duration)
	})
}

func corsMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Access-Control-Allow-Origin", "*")
		w.Header().Set("Access-Control-Allow-Methods", "GET, POST, PUT, DELETE, OPTIONS, PATCH")
		w.Header().Set("Access-Control-Allow-Headers", "Content-Type, Authorization, X-Trace-ID")

		if r.Method == "OPTIONS" {
			w.WriteHeader(http.StatusOK)
			return
		}

		next.ServeHTTP(w, r)
	})
}

func authMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Simple auth for demo - in production use proper authentication
		authHeader := r.Header.Get("Authorization")
		if authHeader == "" {
			authHeader = "Bearer anonymous"
		}

		// Extract user from token
		user := "anonymous"
		if len(authHeader) > 7 && authHeader[:7] == "Bearer " {
			user = authHeader[7:]
		}

		// Add user to context
		ctx := context.WithValue(r.Context(), "user", user)
		next.ServeHTTP(w, r.WithContext(ctx))
	})
}

type statusResponseWriter struct {
	http.ResponseWriter
	statusCode int
}

func (w *statusResponseWriter) WriteHeader(code int) {
	w.statusCode = code
	w.ResponseWriter.WriteHeader(code)
}

Example API Usage

Let’s see our error handling in action with some example requests:

Example 1: Validation Error with Multiple Issues

Request with multiple validation errors

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "",
"description": "This description is wayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy too long…",
"status": "INVALID_STATUS",
"tags": ["INVALID TAG", "tag-1", "tag-1"]
}
}'

Response

< HTTP/1.1 422 Unprocessable Entity
< Content-Type: application/problem+json
{
  "detail": "The request contains 5 validation errors",
  "errors": [
    {
      "code": "TOO_SHORT",
      "field": "title",
      "message": "value length must be at least 1 characters"
    },
    {
      "code": "TOO_LONG",
      "field": "description",
      "message": "value length must be at most 100 characters"
    },
    {
      "code": "INVALID_FORMAT",
      "field": "tags",
      "message": "value does not match regex pattern `^[a-z0-9-]+$`"
    },
    {
      "code": "INVALID_TAG_FORMAT",
      "field": "tags[0]",
      "message": "Tag 'INVALID TAG' must be lowercase letters, numbers, and hyphens only"
    },
    {
      "code": "DUPLICATE_TAG",
      "field": "tags[2]",
      "message": "Tag 'tag-1' appears multiple times"
    }
  ],
  "instance": "/v1/tasks",
  "status": 400,
  "timestamp": {
    "seconds": 1755288524,
    "nanos": 484865000
  },
  "title": "Validation Failed",
  "traceId": "eb4bfb3f-9397-4547-8618-ce9952a16067",
  "type": "https://api.example.com/errors/validation-failed"
}

Example 2: Not Found Error

Request for non-existent task

curl http://localhost:8080/v1/tasks/non-existent-id

Response

< HTTP/1.1 404 Not Found
< Content-Type: application/problem+json
{
  "detail": "Task with ID 'non-existent-id' was not found.",
  "instance": "/v1/tasks/non-existent-id",
  "status": 404,
  "timestamp": {
    "seconds": 1755288565,
    "nanos": 904607000
  },
  "title": "Resource Not Found",
  "traceId": "6ce00cd8-d0b7-47f1-b6f6-9fc1375c26a4",
  "type": "https://api.example.com/errors/resource-not-found"
}

Example 3: Conflict Error

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "Existing Task Title"
}
}'

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "Existing Task Title"
}
}'

Response

< HTTP/1.1 409 Conflict
< Content-Type: application/problem+json
{
  "detail": "Conflict creating task: A task with this title already exists",
  "instance": "/v1/tasks",
  "status": 409,
  "timestamp": {
    "seconds": 1755288593,
    "nanos": 594458000
  },
  "title": "Resource Conflict",
  "traceId": "ed2e78d2-591d-492a-8d71-6b6843ce86f7",
  "type": "https://api.example.com/errors/resource-conflict"
}

Example 4: Service Unavailable (Transient Error)

When database is down

curl http://localhost:8080/v1/tasks

Response

HTTP/1.1 503 Service Unavailable
Content-Type: application/problem+json
Retry-After: 30
{
  "type": "https://api.example.com/errors/service-unavailable",
  "title": "Service Unavailable",
  "status": 503,
  "detail": "Database connection pool exhausted. Please try again later.",
  "instance": "/v1/tasks",
  "traceId": "db-pool-001",
  "timestamp": "2025-08-15T10:30:00Z",
  "extensions": {
    "retryable": true,
    "retryAfter": "2025-08-15T10:30:30Z",
    "maxRetries": 3,
    "backoffType": "exponential",
    "backoffMs": 1000,
    "errorCategory": "database"
  }
}

Best Practices Summary

Our implementation demonstrates several key best practices:

1. Consistent Error Format

All errors follow RFC 9457 (Problem Details) format, providing:

Machine-readable type URIs
Human-readable titles and details
HTTP status codes
Request tracing
Extensible metadata

2. Comprehensive Validation

All validation errors are returned at once, not one by one
Clear field paths for nested objects
Descriptive error codes and messages
Support for batch operations with partial success

3. Security-Conscious Design

No sensitive information in error messages
Internal errors are logged but not exposed
Generic messages for authentication failures
Request IDs for support without exposing internals

4. Developer Experience

Clear, actionable error messages
Helpful suggestions for fixing issues
Consistent error codes across protocols
Rich metadata for debugging

5. Protocol Compatibility

Seamless translation between gRPC and HTTP
Proper status code mapping
Preservation of error details across protocols

6. Observability

Structured logging with trace IDs
Prometheus metrics for monitoring
OpenTelemetry integration
Error categorization for analysis

Conclusion

This comprehensive guide demonstrates how to build robust error handling for modern APIs. By treating errors as a first-class feature of our API, we’ve achieved several key benefits:

Consistency: All errors, regardless of their source, are presented to clients in a predictable format.
Clarity: Developers consuming our API get clear, actionable feedback, helping them debug and integrate faster.
Developer Ergonomics: Our internal service code is cleaner, as handlers focus on business logic while the middleware handles the boilerplate of error conversion.
Security: We have a clear separation between internal error details (for logging) and public error responses, preventing leaks.

Additional Resources

You can find the full source code for this example in this GitHub repository.

Comments (0)

July 17, 2025

Zero-Downtime Services with Lifecycle Management on Kubernetes and Istio

Filed under: Computing,Web Services — admin @ 3:12 pm

Introduction

In the world of cloud-native applications, service lifecycle management is often an afterthought—until it causes a production outage. Whether you’re running gRPC or REST APIs on Kubernetes with Istio, proper lifecycle management is the difference between smooth deployments and 3 AM incident calls. Consider these scenarios:

Your service takes 45 seconds to warm up its cache, but Kubernetes kills it after 30 seconds of startup wait.
During deployments, clients receive connection errors as pods terminate abruptly.
A hiccup in a database or dependent service causes your entire service mesh to cascade fail.
Your service mesh sidecar shuts down before your application is terminated or drops in-flight requests.
A critical service receives SIGKILL during transaction processing, leaving data in inconsistent states.
After a regional outage, services restart but data drift goes undetected for hours.
Your RTO target is 15 seconds, but services take 30 seconds just to start up properly.

These aren’t edge cases—they’re common problems that proper lifecycle management solves. More critically, unsafe shutdowns can cause data corruption, financial losses, and breach compliance requirements. This guide covers what you need to know about building services that start safely, shut down gracefully, and handle failures intelligently.

The Hidden Complexity of Service Lifecycles

Modern microservices don’t exist in isolation. A typical request might flow through:

Each layer adds complexity to startup and shutdown sequences. Without proper coordination, you’ll experience:

Startup race conditions: Application tries to make network calls before the sidecar proxy is ready
Shutdown race conditions: Sidecar terminates while the application is still processing requests
Premature traffic: Load balancer routes traffic before the application is truly ready
Dropped connections: Abrupt shutdowns leave clients hanging
Data corruption: In-flight transactions get interrupted, leaving databases in inconsistent states
Compliance violations: Financial services may face regulatory penalties for data integrity failures

Core Concepts: The Three Types of Health Checks

Kubernetes provides three distinct probe types, each serving a specific purpose:

1. Liveness Probe: “Is the process alive?”

Detects deadlocks and unrecoverable states
Should be fast and simple (e.g., HTTP GET /healthz)
Failure triggers container restart
Common mistake: Making this check too complex

2. Readiness Probe: “Can the service handle traffic?”

Validates all critical dependencies are available
Prevents routing traffic to pods that aren’t ready
Should perform “deep” checks of dependencies
Common mistake: Using the same check as liveness

3. Startup Probe: “Is the application still initializing?”

Provides grace period for slow-starting containers
Disables liveness/readiness probes until successful
Prevents restart loops during initialization
Common mistake: Not using it for slow-starting apps

The Hidden Dangers of Unsafe Shutdowns

While graceful shutdown is ideal, it’s not always possible. Kubernetes will send SIGKILL after the termination grace period, and infrastructure failures can terminate pods instantly. This creates serious risks:

Data Corruption Scenarios

Financial Transaction Example:

// DANGEROUS: Non-atomic operation
func (s *PaymentService) ProcessPayment(req *PaymentRequest) error {
    // Step 1: Debit source account
    if err := s.debitAccount(req.FromAccount, req.Amount); err != nil {
        return err
    }
    
    // ???? SIGKILL here leaves money debited but not credited
    // Step 2: Credit destination account  
    if err := s.creditAccount(req.ToAccount, req.Amount); err != nil {
        // Money is lost! Source debited but destination not credited
        return err
    }
    
    // Step 3: Record transaction
    return s.recordTransaction(req)
}

E-commerce Inventory Example:

// DANGEROUS: Race condition during shutdown
func (s *InventoryService) ReserveItem(req *ReserveRequest) error {
    // Check availability
    if s.getStock(req.ItemID) < req.Quantity {
        return ErrInsufficientStock
    }
    
    // ???? SIGKILL here can cause double-reservation
    // Another request might see the same stock level
    
    // Reserve the item
    return s.updateStock(req.ItemID, -req.Quantity)
}

RTO/RPO Impact

Recovery Time Objective (RTO): How quickly can we restore service?

Poor lifecycle management increases startup time
Services may need manual intervention to reach consistent state
Cascading failures extend recovery time across the entire system

Recovery Point Objective (RPO): How much data can we afford to lose?

Unsafe shutdowns can corrupt recent transactions
Without idempotency, replay of messages may create duplicates
Data inconsistencies may not be detected until much later

The Anti-Entropy Solution

Since graceful shutdown isn’t always possible, production systems need reconciliation processes to detect and repair inconsistencies:

// Anti-entropy pattern for data consistency
type ReconciliationService struct {
    paymentDB    PaymentDatabase
    accountDB    AccountDatabase
    auditLog     AuditLogger
    alerting     AlertingService
}

func (r *ReconciliationService) ReconcilePayments(ctx context.Context) error {
    // Find payments without matching account entries
    orphanedPayments, err := r.paymentDB.FindOrphanedPayments(ctx)
    if err != nil {
        return err
    }
    
    for _, payment := range orphanedPayments {
        // Check if this was a partial transaction
        sourceDebit, _ := r.accountDB.GetTransaction(payment.FromAccount, payment.ID)
        destCredit, _ := r.accountDB.GetTransaction(payment.ToAccount, payment.ID)
        
        switch {
        case sourceDebit != nil && destCredit == nil:
            // Complete the transaction
            if err := r.creditAccount(payment.ToAccount, payment.Amount); err != nil {
                r.alerting.SendAlert("Failed to complete orphaned payment", payment.ID)
                continue
            }
            r.auditLog.RecordReconciliation("completed_payment", payment.ID)
            
        case sourceDebit == nil && destCredit != nil:
            // Reverse the credit
            if err := r.debitAccount(payment.ToAccount, payment.Amount); err != nil {
                r.alerting.SendAlert("Failed to reverse orphaned credit", payment.ID)
                continue
            }
            r.auditLog.RecordReconciliation("reversed_credit", payment.ID)
            
        default:
            // Both or neither exist - needs investigation
            r.alerting.SendAlert("Ambiguous payment state", payment.ID)
        }
    }
    
    return nil
}

// Run reconciliation periodically
func (r *ReconciliationService) Start(ctx context.Context) {
    ticker := time.NewTicker(5 * time.Minute)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            if err := r.ReconcilePayments(ctx); err != nil {
                log.Printf("Reconciliation failed: %v", err)
            }
        }
    }
}

Building a Resilient Service: Complete Example

Let’s build a production-ready service that demonstrates all best practices. We’ll create two versions: one with anti-patterns (bad-service) and one with best practices (good-service).

Sequence diagram of a typical API with proper Kubernetes and Istio configuration.

The Application Code

//go:generate protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative api/demo.proto

package main

import (
    "context"
    "flag"
    "fmt"
    "log"
    "net"
    "net/http"
    "os"
    "os/signal"
    "sync/atomic"
    "syscall"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    health "google.golang.org/grpc/health/grpc_health_v1"
    "google.golang.org/grpc/status"
)

// Service represents our application with health state
type Service struct {
    isHealthy         atomic.Bool
    isShuttingDown    atomic.Bool
    activeRequests    atomic.Int64
    dependencyHealthy atomic.Bool
}

// HealthChecker implements the gRPC health checking protocol
type HealthChecker struct {
    svc *Service
}

func (h *HealthChecker) Check(ctx context.Context, req *health.HealthCheckRequest) (*health.HealthCheckResponse, error) {
    service := req.GetService()
    
    // Liveness: Simple check - is the process responsive?
    if service == "" || service == "liveness" {
        if h.svc.isShuttingDown.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    // Readiness: Deep check - can we handle traffic?
    if service == "readiness" {
        // Check application health
        if !h.svc.isHealthy.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        // Check critical dependencies
        if !h.svc.dependencyHealthy.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        // Check if shutting down
        if h.svc.isShuttingDown.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    // Synthetic readiness: Complex business logic check for monitoring
    if service == "synthetic-readiness" {
        // Simulate a complex health check that validates business logic
        // This would make actual API calls, database queries, etc.
        if !h.performSyntheticCheck(ctx) {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    return nil, status.Errorf(codes.NotFound, "unknown service: %s", service)
}

func (h *HealthChecker) performSyntheticCheck(ctx context.Context) bool {
    // In a real service, this would:
    // 1. Create a test transaction
    // 2. Query the database
    // 3. Call dependent services
    // 4. Validate the complete flow works
    return h.svc.isHealthy.Load() && h.svc.dependencyHealthy.Load()
}

func (h *HealthChecker) Watch(req *health.HealthCheckRequest, server health.Health_WatchServer) error {
    return status.Error(codes.Unimplemented, "watch not implemented")
}

// DemoServiceServer implements your business logic
type DemoServiceServer struct {
    UnimplementedDemoServiceServer
    svc *Service
}

func (s *DemoServiceServer) ProcessRequest(ctx context.Context, req *ProcessRequest) (*ProcessResponse, error) {
    s.svc.activeRequests.Add(1)
    defer s.svc.activeRequests.Add(-1)
    
    // Simulate processing
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    case <-time.After(100 * time.Millisecond):
        return &ProcessResponse{
            Result: fmt.Sprintf("Processed: %s", req.GetData()),
        }, nil
    }
}

func main() {
    var (
        port         = flag.Int("port", 8080, "gRPC port")
        mgmtPort     = flag.Int("mgmt-port", 8090, "Management port")
        startupDelay = flag.Duration("startup-delay", 10*time.Second, "Startup delay")
    )
    flag.Parse()
    
    svc := &Service{}
    svc.dependencyHealthy.Store(true) // Assume healthy initially
    
    // Management endpoints for testing
    mux := http.NewServeMux()
    mux.HandleFunc("/toggle-health", func(w http.ResponseWriter, r *http.Request) {
        current := svc.dependencyHealthy.Load()
        svc.dependencyHealthy.Store(!current)
        fmt.Fprintf(w, "Dependency health toggled to: %v\n", !current)
    })
    mux.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "active_requests %d\n", svc.activeRequests.Load())
        fmt.Fprintf(w, "is_healthy %v\n", svc.isHealthy.Load())
        fmt.Fprintf(w, "is_shutting_down %v\n", svc.isShuttingDown.Load())
    })
    
    mgmtServer := &http.Server{
        Addr:    fmt.Sprintf(":%d", *mgmtPort),
        Handler: mux,
    }
    
    // Start management server
    go func() {
        log.Printf("Management server listening on :%d", *mgmtPort)
        if err := mgmtServer.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("Management server failed: %v", err)
        }
    }()
    
    // Simulate slow startup
    log.Printf("Starting application (startup delay: %v)...", *startupDelay)
    time.Sleep(*startupDelay)
    svc.isHealthy.Store(true)
    log.Println("Application initialized and ready")
    
    // Setup gRPC server
    lis, err := net.Listen("tcp", fmt.Sprintf(":%d", *port))
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }
    
    grpcServer := grpc.NewServer()
    RegisterDemoServiceServer(grpcServer, &DemoServiceServer{svc: svc})
    health.RegisterHealthServer(grpcServer, &HealthChecker{svc: svc})
    
    // Start gRPC server
    go func() {
        log.Printf("gRPC server listening on :%d", *port)
        if err := grpcServer.Serve(lis); err != nil {
            log.Fatalf("gRPC server failed: %v", err)
        }
    }()
    
    // Wait for shutdown signal
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
    sig := <-sigCh
    
    log.Printf("Received signal: %v, starting graceful shutdown...", sig)
    
    // Graceful shutdown sequence
    svc.isShuttingDown.Store(true)
    svc.isHealthy.Store(false) // Fail readiness immediately
    
    // Stop accepting new requests
    grpcServer.GracefulStop()
    
    // Wait for active requests to complete
    timeout := time.After(30 * time.Second)
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()
    
    for {
        select {
        case <-timeout:
            log.Println("Shutdown timeout reached, forcing exit")
            os.Exit(1)
        case <-ticker.C:
            active := svc.activeRequests.Load()
            if active == 0 {
                log.Println("All requests completed")
                goto shutdown
            }
            log.Printf("Waiting for %d active requests to complete...", active)
        }
    }
    
shutdown:
    // Cleanup
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    mgmtServer.Shutdown(ctx)
    
    log.Println("Graceful shutdown complete")
}

Kubernetes Manifests: Anti-Patterns vs Best Practices

Bad Service (Anti-Patterns)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bad-service
  namespace: demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: bad-service
  template:
    metadata:
      labels:
        app: bad-service
      # MISSING: Critical Istio annotations!
    spec:
      # DEFAULT: Only 30s grace period
      containers:
      - name: app
        image: myregistry/demo-service:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: mgmt
        args: ["--startup-delay=45s"]  # Longer than default probe timeout!
        
        # ANTI-PATTERN: Identical liveness and readiness probes
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080"]
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3  # Will fail after 40s total
          
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080"]  # Same as liveness!
          initialDelaySeconds: 10
          periodSeconds: 10
        
        # MISSING: No startup probe for slow initialization
        # MISSING: No preStop hook for graceful shutdown

Good Service (Best Practices)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: good-service
  namespace: demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: good-service
  template:
    metadata:
      labels:
        app: good-service
      annotations:
        # Critical for Istio/Envoy sidecar lifecycle management
        sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
        proxy.istio.io/config: |
          proxyMetadata:
            EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"
        sidecar.istio.io/proxyCPU: "100m"
        sidecar.istio.io/proxyMemory: "128Mi"
    spec:
      # Extended grace period: preStop (15s) + app shutdown (30s) + buffer (20s)
      terminationGracePeriodSeconds: 65
      
      containers:
      - name: app
        image: myregistry/demo-service:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: mgmt
        args: ["--startup-delay=45s"]
        
        # Resource management for predictable performance
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        
        # Startup probe for slow initialization
        startupProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=readiness"]
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 24  # 5s * 24 = 120s total startup time
          successThreshold: 1
        
        # Simple liveness check
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=liveness"]
          initialDelaySeconds: 0  # Startup probe handles initialization
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        
        # Deep readiness check
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=readiness"]
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 2
          successThreshold: 1
          timeoutSeconds: 5
        
        # Graceful shutdown coordination
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Allow LB to drain
        
        # Environment variables for cloud provider integration
        env:
        - name: CLOUD_PROVIDER
          value: "auto-detect"  # Works with GCP, AWS, Azure
        - name: ENABLE_PROFILING
          value: "true"

Istio Service Mesh: Beyond Basic Lifecycle Management

While proper health checks and graceful shutdown are foundational, Istio adds critical production-grade capabilities that dramatically improve fault tolerance:

Automatic Retries and Circuit Breaking

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
  namespace: demo
spec:
  host: payment-service.demo.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 2
    circuitBreaker:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    retryPolicy:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,gateway-error,connect-failure,refused-stream
      retryRemoteLocalities: true

Key Benefits for Production Systems

Automatic Request Retries: If a pod fails or becomes unavailable, Istio automatically retries requests to healthy instances
Circuit Breaking: Prevents cascading failures by temporarily cutting off traffic to unhealthy services
Load Balancing: Distributes traffic intelligently across healthy pods
Mutual TLS: Secures service-to-service communication without code changes
Observability: Provides detailed metrics, tracing, and logging for all inter-service communication
Canary Deployments: Enables safe rollouts with automatic traffic shifting
Rate Limiting: Protects services from being overwhelmed
Timeout Management: Prevents hanging requests with configurable timeouts

Termination Grace Period Calculation

The critical formula for calculating termination grace periods:

terminationGracePeriodSeconds = preStop delay + application shutdown timeout + buffer

Examples:
- Simple service: 10s + 20s + 5s = 35s
- Complex service: 15s + 45s + 5s = 65s
- Batch processor: 30s + 120s + 10s = 160s

Important: Services requiring more than 90-120 seconds to shut down should be re-architected using checkpoint-and-resume patterns.

Advanced Patterns for Production

1. Idempotency: Handling Duplicate Requests

Critical for production: When pods restart or network issues occur, clients may retry requests. Without idempotency, this can cause duplicate transactions, corrupted state, or financial losses. This is mandatory for all state-modifying operations.

package idempotency

import (
    "context"
    "crypto/sha256"
    "encoding/hex"
    "time"
    "sync"
    "errors"
)

var (
    ErrDuplicateRequest = errors.New("duplicate request detected")
    ErrProcessingInProgress = errors.New("request is currently being processed")
)

// IdempotencyStore tracks request execution with persistence
type IdempotencyStore struct {
    mu        sync.RWMutex
    records   map[string]*Record
    persister PersistenceLayer // Database or Redis for durability
}

type Record struct {
    Key         string
    Response    interface{}
    Error       error
    Status      ProcessingStatus
    ExpiresAt   time.Time
    CreatedAt   time.Time
    ProcessedAt *time.Time
}

type ProcessingStatus int

const (
    StatusPending ProcessingStatus = iota
    StatusProcessing
    StatusCompleted
    StatusFailed
)

// ProcessIdempotent ensures exactly-once processing semantics
func (s *IdempotencyStore) ProcessIdempotent(
    ctx context.Context,
    key string,
    ttl time.Duration,
    fn func() (interface{}, error),
) (interface{}, error) {
    // Check if we've seen this request before
    s.mu.RLock()
    record, exists := s.records[key]
    s.mu.RUnlock()
    
    if exists {
        switch record.Status {
        case StatusCompleted:
            if time.Now().Before(record.ExpiresAt) {
                return record.Response, record.Error
            }
        case StatusProcessing:
            return nil, ErrProcessingInProgress
        case StatusFailed:
            if time.Now().Before(record.ExpiresAt) {
                return record.Response, record.Error
            }
        }
    }
    
    // Mark as processing
    record = &Record{
        Key:       key,
        Status:    StatusProcessing,
        ExpiresAt: time.Now().Add(ttl),
        CreatedAt: time.Now(),
    }
    
    s.mu.Lock()
    s.records[key] = record
    s.mu.Unlock()
    
    // Persist the processing state
    if err := s.persister.Save(ctx, record); err != nil {
        return nil, err
    }
    
    // Execute the function
    response, err := fn()
    processedAt := time.Now()
    
    // Update record with result
    s.mu.Lock()
    record.Response = response
    record.Error = err
    record.ProcessedAt = &processedAt
    if err != nil {
        record.Status = StatusFailed
    } else {
        record.Status = StatusCompleted
    }
    s.mu.Unlock()
    
    // Persist the final state
    s.persister.Save(ctx, record)
    
    return response, err
}

// Example: Idempotent payment processing
func (s *PaymentService) ProcessPayment(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
    // Generate idempotency key from request
    key := generateIdempotencyKey(req)
    
    result, err := s.idempotencyStore.ProcessIdempotent(
        ctx,
        key,
        24*time.Hour, // Keep records for 24 hours
        func() (interface{}, error) {
            // Atomic transaction processing
            return s.processPaymentTransaction(ctx, req)
        },
    )
    
    if err != nil {
        return nil, err
    }
    return result.(*PaymentResponse), nil
}

// Atomic transaction processing
func (s *PaymentService) processPaymentTransaction(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
    // Use database transaction for atomicity
    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return nil, err
    }
    defer tx.Rollback()
    
    // Step 1: Validate accounts
    if err := s.validateAccounts(ctx, tx, req); err != nil {
        return nil, err
    }
    
    // Step 2: Process payment atomically
    paymentID, err := s.executePayment(ctx, tx, req)
    if err != nil {
        return nil, err
    }
    
    // Step 3: Commit transaction
    if err := tx.Commit(); err != nil {
        return nil, err
    }
    
    return &PaymentResponse{
        PaymentID: paymentID,
        Status:    "completed",
        Timestamp: time.Now(),
    }, nil
}

2. Checkpoint and Resume: Long-Running Operations

For operations that may exceed the termination grace period, implement checkpointing:

package checkpoint

import (
    "context"
    "encoding/json"
    "time"
)

type CheckpointStore interface {
    Save(ctx context.Context, id string, state interface{}) error
    Load(ctx context.Context, id string, state interface{}) error
    Delete(ctx context.Context, id string) error
}

type BatchProcessor struct {
    store          CheckpointStore
    checkpointFreq int
}

type BatchState struct {
    JobID      string    `json:"job_id"`
    TotalItems int       `json:"total_items"`
    Processed  int       `json:"processed"`
    LastItem   string    `json:"last_item"`
    StartedAt  time.Time `json:"started_at"`
}

func (p *BatchProcessor) ProcessBatch(ctx context.Context, jobID string, items []string) error {
    // Try to resume from checkpoint
    state := &BatchState{JobID: jobID}
    if err := p.store.Load(ctx, jobID, state); err == nil {
        log.Printf("Resuming job %s from item %d", jobID, state.Processed)
        items = items[state.Processed:]
    } else {
        // New job
        state = &BatchState{
            JobID:      jobID,
            TotalItems: len(items),
            Processed:  0,
            StartedAt:  time.Now(),
        }
    }
    
    // Process items with periodic checkpointing
    for i, item := range items {
        select {
        case <-ctx.Done():
            // Save progress before shutting down
            state.LastItem = item
            return p.store.Save(ctx, jobID, state)
        default:
            // Process item
            if err := p.processItem(ctx, item); err != nil {
                return err
            }
            
            state.Processed++
            state.LastItem = item
            
            // Checkpoint periodically
            if state.Processed%p.checkpointFreq == 0 {
                if err := p.store.Save(ctx, jobID, state); err != nil {
                    log.Printf("Failed to checkpoint: %v", err)
                }
            }
        }
    }
    
    // Job completed, remove checkpoint
    return p.store.Delete(ctx, jobID)
}

3. Circuit Breaker Pattern for Dependencies

Protect your service from cascading failures:

package circuitbreaker

import (
    "context"
    "sync"
    "time"
)

type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu              sync.RWMutex
    state           State
    failures        int
    successes       int
    lastFailureTime time.Time
    
    maxFailures      int
    resetTimeout     time.Duration
    halfOpenRequests int
}

func (cb *CircuitBreaker) Call(ctx context.Context, fn func() error) error {
    cb.mu.RLock()
    state := cb.state
    cb.mu.RUnlock()
    
    if state == StateOpen {
        // Check if we should transition to half-open
        cb.mu.Lock()
        if time.Since(cb.lastFailureTime) > cb.resetTimeout {
            cb.state = StateHalfOpen
            cb.successes = 0
            state = StateHalfOpen
        }
        cb.mu.Unlock()
    }
    
    if state == StateOpen {
        return ErrCircuitOpen
    }
    
    err := fn()
    
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    if err != nil {
        cb.failures++
        cb.lastFailureTime = time.Now()
        
        if cb.failures >= cb.maxFailures {
            cb.state = StateOpen
            log.Printf("Circuit breaker opened after %d failures", cb.failures)
        }
        return err
    }
    
    if state == StateHalfOpen {
        cb.successes++
        if cb.successes >= cb.halfOpenRequests {
            cb.state = StateClosed
            cb.failures = 0
            log.Println("Circuit breaker closed")
        }
    }
    
    return nil
}

Testing Your Implementation

Manual Testing Guide

Test 1: Startup Race Condition

Setup:

# Deploy both services
kubectl apply -f k8s/bad-service.yaml
kubectl apply -f k8s/good-service.yaml

# Watch pods in separate terminal
watch kubectl get pods -n demo

Test the bad service:

# Force restart
kubectl delete pod -l app=bad-service -n demo

# Observe: Pod will enter CrashLoopBackOff due to liveness probe
# killing it before 45s startup completes

Test the good service:

# Force restart
kubectl delete pod -l app=good-service -n demo

# Observe: Pod stays in 0/1 Ready state for ~45s, then becomes ready
# No restarts occur thanks to startup probe

Test 2: Data Consistency Under Failure

Setup:

# Deploy payment service with reconciliation enabled
kubectl apply -f k8s/payment-service.yaml

# Start payment traffic generator
kubectl run payment-generator --image=payment-client:latest \
  --restart=Never --rm -it -- \
  --target=payment-service.demo.svc.cluster.local:8080 \
  --rate=10 --duration=60s

Simulate SIGKILL during transactions:

# In another terminal, kill pods abruptly
while true; do
  kubectl delete pod -l app=payment-service -n demo --force --grace-period=0
  sleep 30
done

Verify reconciliation:

# Check for data inconsistencies
kubectl logs -l app=payment-service -n demo | grep "inconsistency"

# Monitor reconciliation metrics
kubectl port-forward svc/payment-service 8090:8090
curl http://localhost:8090/metrics | grep consistency

Test 3: RTO/RPO Validation

Disaster Recovery Simulation:

# Simulate regional failure
kubectl patch deployment payment-service -n demo \
  --patch '{"spec":{"replicas":0}}'

# Measure RTO - time to restore service
start_time=$(date +%s)
kubectl patch deployment payment-service -n demo \
  --patch '{"spec":{"replicas":3}}'

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod -l app=payment-service -n demo --timeout=900s
end_time=$(date +%s)
rto=$((end_time - start_time))

echo "RTO: ${rto} seconds"
if [ $rto -le 900 ]; then
  echo "? RTO target met (15 minutes)"
else
  echo "? RTO target exceeded"
fi

Test 4: Istio Resilience Features

Automatic Retry Testing:

# Deploy with fault injection
kubectl apply -f istio/fault-injection.yaml

# Generate requests with chaos header
for i in {1..100}; do
  grpcurl -H "x-chaos-test: true" -plaintext \
    payment-service.demo.svc.cluster.local:8080 \
    PaymentService/ProcessPayment \
    -d '{"amount": 100, "currency": "USD"}'
done

# Check Istio metrics for retry behavior
kubectl exec -n istio-system deployment/istiod -- \
  pilot-agent request GET stats/prometheus | grep retry

Monitoring and Observability

RTO/RPO Considerations

Recovery Time Objective (RTO): Target time to restore service after an outage Recovery Point Objective (RPO): Maximum acceptable data loss

Your service lifecycle design directly impacts these critical business metrics:

package monitoring

import (
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // RTO-related metrics
    ServiceStartupTime = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "service_startup_duration_seconds",
        Help: "Time from pod start to service ready",
        Buckets: []float64{1, 5, 10, 30, 60, 120, 300, 600}, // Up to 10 minutes
    }, []string{"service", "version"})
    
    ServiceRecoveryTime = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "service_recovery_duration_seconds", 
        Help: "Time to recover from failure state",
        Buckets: []float64{1, 5, 10, 30, 60, 300, 900}, // Up to 15 minutes
    }, []string{"service", "failure_type"})
    
    // RPO-related metrics
    LastCheckpointAge = promauto.NewGaugeVec(prometheus.GaugeOpts{
        Name: "last_checkpoint_age_seconds",
        Help: "Age of last successful checkpoint",
    }, []string{"service", "checkpoint_type"})
    
    DataConsistencyChecks = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "data_consistency_checks_total",
        Help: "Total number of consistency checks performed",
    }, []string{"service", "check_type", "status"})
    
    InconsistencyDetected = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "data_inconsistencies_detected_total",
        Help: "Total number of data inconsistencies detected",
    }, []string{"service", "inconsistency_type", "severity"})
)

Grafana Dashboard

{
  "dashboard": {
    "title": "Service Lifecycle - Business Impact",
    "panels": [
      {
        "title": "RTO Compliance",
        "description": "Percentage of recoveries meeting RTO target (15 minutes)",
        "targets": [{
          "expr": "100 * (histogram_quantile(0.95, service_recovery_duration_seconds_bucket) <= 900)"
        }],
        "thresholds": [
          {"value": 95, "color": "green"},
          {"value": 90, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "RPO Risk Assessment",
        "description": "Data at risk based on checkpoint age",
        "targets": [{
          "expr": "last_checkpoint_age_seconds / 60"
        }],
        "unit": "minutes"
      },
      {
        "title": "Data Consistency Status",
        "targets": [{
          "expr": "rate(data_inconsistencies_detected_total[5m])"
        }]
      }
    ]
  }
}

Production Readiness Checklist

Before deploying to production, ensure your service meets these criteria:

Application Layer

[ ] Implements separate liveness and readiness endpoints
[ ] Readiness checks validate all critical dependencies
[ ] Graceful shutdown drains in-flight requests
[ ] Idempotency for all state-modifying operations
[ ] Anti-entropy/reconciliation processes implemented
[ ] Circuit breakers for external dependencies
[ ] Checkpoint-and-resume for long-running operations
[ ] Structured logging with correlation IDs
[ ] Metrics for startup, shutdown, and health status

Kubernetes Configuration

[ ] Startup probe for slow-initializing services
[ ] Distinct liveness and readiness probes
[ ] Calculated terminationGracePeriodSeconds based on actual shutdown time
[ ] PreStop hooks for load balancer draining
[ ] Resource requests and limits defined
[ ] PodDisruptionBudget for availability
[ ] Anti-affinity rules for high availability

Service Mesh Integration

[ ] Istio sidecar lifecycle annotations (holdApplicationUntilProxyStarts)
[ ] Istio automatic retry policies configured
[ ] Circuit breaker configuration in DestinationRule
[ ] Distributed tracing enabled
[ ] mTLS for service-to-service communication

Data Integrity & Recovery

[ ] RTO/RPO metrics tracked and alerting configured
[ ] Reconciliation processes tested with Game Day exercises
[ ] Chaos engineering tests validate failure scenarios
[ ] Synthetic monitoring for end-to-end business flows
[ ] Backup and restore procedures documented and tested

Common Pitfalls and Solutions

1. My service keeps restarting during deployment:

Symptom: Pods enter CrashLoopBackOff during rollout

Common Causes:

Liveness probe starts before application is ready
Startup time exceeds probe timeout
Missing startup probe

Solution:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30  # 30 * 10s = 5 minutes
  periodSeconds: 10

2. Data corruption during pod restarts:

Symptom: Inconsistent database state after deployments

Common Causes:

Non-atomic operations
Missing idempotency
No reconciliation processes

Solution:

// Implement atomic operations with database transactions
tx, err := db.BeginTx(ctx, nil)
if err != nil {
    return err
}
defer tx.Rollback()

// All operations within transaction
if err := processPayment(tx, req); err != nil {
    return err // Automatic rollback
}

return tx.Commit()

3. Service mesh sidecar issues:

Symptom: ECONNREFUSED errors on startup

Common Causes:

Application starts before sidecar is ready
Sidecar terminates before application

Solution:

annotations:
  sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
  proxy.istio.io/config: |
    proxyMetadata:
      EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"

Conclusion

Service lifecycle management is not just about preventing outages—it’s about building systems that are predictable, observable, and resilient to the inevitable failures that occur in distributed systems. This allows:

Zero-downtime deployments: Services gracefully handle rollouts without data loss.
Improved reliability: Proper health checks prevent cascading failures.
Better observability: Clear signals about service state and data consistency.
Faster recovery: Services self-heal from transient failures.
Data integrity: Idempotency and reconciliation prevent corruption.
Compliance readiness: Meet RTO/RPO requirements for disaster recovery.
Financial protection: Prevent duplicate transactions and data corruption that could cost millions.

The difference between a service that “works on my machine” and one that thrives in production lies in these details. Whether you’re running on GKE, EKS, or AKS, these patterns form the foundation of production-ready microservices.

Want to test these patterns yourself? The complete code examples and deployment manifests are available on GitHub.

Comments (0)

July 11, 2025

Building Resilient, Interactive Playbooks with Formicary

Filed under: Computing,Technology — admin @ 8:16 pm

In any complex operational environment, the most challenging processes are often those that can’t be fully automated. A CI/CD pipeline might be 99% automated, but that final push to production requires a sign-off. A disaster recovery plan might be scripted, but you need a human to make the final call to failover. These “human-in-the-loop” scenarios are where rigid automation fails and manual checklists introduce risk.

Formicary is a distributed orchestration engine designed to bridge this gap. It allows you to codify your entire operational playbook—from automated scripts to manual verification steps—into a single, version-controlled workflow. This post will guide you through Formicary‘s core concepts and demonstrate how to build two powerful, real-world playbooks:

A Secure CI/CD Pipeline that builds, scans, and deploys to staging, then pauses for manual approval before promoting to production.
A Semi-Automated Disaster Recovery Playbook that uses mocked Infrastructure as Code (IaC) to provision a new environment and waits for an operator’s go-ahead before failing over.

Formicary Features and Architecture

Formicary combines the robust workflow capabilities with the practical CI/CD features, all in a self-hosted, extensible platform.

Core Features

Declarative Workflows: Define complex jobs as a Directed Acyclic Graph (DAG) in a single, human-readable YAML file. Your entire playbook is version-controlled code.
Versatile Executors: A task is not tied to a specific runtime. Use the method that fits the job: KUBERNETES, DOCKER, SHELL, or even HTTP API calls.
Advanced Flow Control: Go beyond simple linear stages. Use on_exit_code to branch your workflow based on a script’s result, create polling “sensor” tasks, and define robust retry logic.
Manual Approval Gates: Explicitly define MANUAL tasks that pause the workflow and require human intervention to proceed via the UI or API.
Security Built-in: Manage secrets with database-level encryption and automatic log redaction. An RBAC model controls user access.

Architecture in a Nutshell

Formicary operates on a leader-follower model. The Queen server acts as the control plane, while one or more Ant workers form the execution plane.

Queen Server: The central orchestrator. It manages job definitions, schedules pending jobs based on priority, and tracks the state of all workers and executions.
Ant Workers: The workhorses. They register with the Queen, advertising their capabilities (e.g., supported executors and tags like gpu-enabled). They pick up tasks from the message queue and execute them.
Backend: Formicary relies on a database (like Postgres or MySQL) for state, a message queue (like Go Channels, Redis or Pulsar) for communication, and an S3-compatible object store for artifacts.

Getting Started: A Local Formicary Environment

The quickest way to get started is with the provided Docker Compose setup.

Prerequisites

Docker & Docker Compose
A local Kubernetes cluster (like Docker Desktop’s Kubernetes, Minikube, or k3s) with its kubeconfig file correctly set up. The embedded Ant worker will use this to run Kubernetes tasks.

Installation Steps

Clone the Repository: git clone https://github.com/bhatti/formicary.git && cd formicary
Launch the System:
This command starts the Queen server, a local Ant worker, Redis, and MinIO object storage. docker-compose up
Explore the Dashboard:
Once the services are running, open your browser to http://localhost:7777.

Example 1: Secure CI/CD with Manual Production Deploy

Our goal is to build a CI/CD pipeline for a Go application that:

Builds the application binary.
Runs static analysis (gosec) and saves the report.
Deploys automatically to a staging environment.
Pauses for manual verification.
If approved, deploys to production.

Here is the complete playbook definition:

job_type: secure-go-cicd
description: Build, scan, and deploy a Go application with a manual production gate.
tasks:
- task_type: build
  method: KUBERNETES
  container:
    image: golang:1.24-alpine
  script:
    - echo "Building Go binary..."
    - go build -o my-app ./...
  artifacts:
    paths: [ "my-app" ]
  on_completed: security-scan

- task_type: security-scan
  method: KUBERNETES
  container:
    image: securego/gosec:latest
  allow_failure: true # We want the report even if it finds issues
  script:
    - echo "Running SAST scan with gosec..."
    # The -no-fail flag prevents the task from failing the pipeline immediately.
    - gosec -fmt=sarif -out=gosec-report.sarif ./...
  artifacts:
    paths: [ "gosec-report.sarif" ]
  on_completed: deploy-staging

- task_type: deploy-staging
  method: KUBERNETES
  dependencies: [ "build" ]
  container:
    image: alpine:latest
  script:
    - echo "Deploying ./my-app to staging..."
    - sleep 5 # Simulate deployment work
    - echo "Staging deployment complete. Endpoint: http://staging.example.com"
  on_completed: verify-production-deploy

- task_type: verify-production-deploy
  method: MANUAL
  description: "Staging deployment complete. A security scan report is available as an artifact. Please verify the staging environment and the report before promoting to production."
  on_exit_code:
    APPROVED: promote-production
    REJECTED: rollback-staging

- task_type: promote-production
  method: KUBERNETES
  dependencies: [ "build" ]
  container:
    image: alpine:latest
  script:
    - echo "PROMOTING ./my-app TO PRODUCTION! This is a critical, irreversible step."
  on_completed: cleanup

- task_type: rollback-staging
  method: KUBERNETES
  container:
    image: alpine:latest
  script:
    - echo "Deployment was REJECTED. Rolling back staging environment now."
  on_completed: cleanup

- task_type: cleanup
  method: KUBERNETES
  always_run: true
  container:
    image: alpine:latest
  script:
    - echo "Pipeline finished."

Executing the Playbook

Upload the Job Definition: curl -X POST http://localhost:7777/api/jobs/definitions \ -H "Content-Type: application/yaml" \ --data-binary @playbooks/secure-ci-cd.yaml
Submit the Job Request: curl -X POST http://localhost:7777/api/jobs/requests \ -H "Content-Type: application/json" \ -d '{"job_type": "secure-go-cicd"}'
Monitor and Approve:
- Go to the dashboard. You will see the job run through build, security-scan, and deploy-staging.
- The job will then enter the MANUAL_APPROVAL_REQUIRED state.
- On the job’s detail page, you will see an “Approve” button next to the verify-production-deploy task.
- To approve via the API, get the Job Request ID and the Task Execution ID from the UI or API, then run:

Once approved, the playbook will proceed to promote-production and run the final cleanup step.

Example 2: Semi-Automated Disaster Recovery Playbook

Now for a more critical scenario: failing over a service to a secondary region. This playbook uses mocked IaC steps and pauses for the crucial final decision.

job_type: aws-region-failover
description: A playbook to provision and failover to a secondary region.
tasks:
- task_type: check-primary-status
  method: KUBERNETES
  container:
    image: alpine:latest
  script:
    - echo "Pinging primary region endpoint... it's down! Initiating failover procedure."
    - exit 1 # Simulate failure to trigger the 'on_failed' path
  on_completed: no-op # This path is not taken in our simulation
  on_failed: provision-secondary-infra

- task_type: provision-secondary-infra
  method: KUBERNETES
  container:
    image: hashicorp/terraform:light
  script:
    - echo "Simulating 'terraform apply' to provision DR infrastructure in us-west-2..."
    - sleep 10 # Simulate time for infra to come up
    - echo "Terraform apply complete. Outputting simulated state file."
    - echo '{"aws_instance.dr_server": {"id": "i-12345dr"}}' > terraform.tfstate
  artifacts:
    paths: [ "terraform.tfstate" ]
  on_completed: verify-failover

- task_type: verify-failover
  method: MANUAL
  description: "Secondary infrastructure in us-west-2 has been provisioned. The terraform.tfstate file is available as an artifact. Please VERIFY COSTS and readiness. Approve to switch live traffic."
  on_exit_code:
    APPROVED: switch-dns
    REJECTED: teardown-secondary-infra

- task_type: switch-dns
  method: KUBERNETES
  container:
    image: amazon/aws-cli
  script:
    - echo "CRITICAL: Switching production DNS records to the us-west-2 environment..."
    - sleep 5
    - echo "DNS failover complete. Traffic is now routed to the DR region."
  on_completed: notify-completion

- task_type: teardown-secondary-infra
  method: KUBERNETES
  container:
    image: hashicorp/terraform:light
  script:
    - echo "Failover REJECTED. Simulating 'terraform destroy' for secondary infrastructure..."
    - sleep 10
    - echo "Teardown complete."
  on_completed: notify-completion

- task_type: notify-completion
  method: KUBERNETES
  always_run: true
  container:
    image: alpine:latest
  script:
    - echo "Disaster recovery playbook has concluded."

Executing the DR Playbook

The execution flow is similar to the first example. An operator would trigger this job, wait for the provision-secondary-infra task to complete, download and review the terraform.tfstate artifact, and then make the critical “Approve” or “Reject” decision.

Conclusion

Formicary helps you turn your complex operational processes into reliable, trackable workflows that run automatically. It uses containers to execute tasks and includes manual approval checkpoints, so you can automate your work with confidence. This approach reduces human mistakes while making sure people stay in charge of the important decisions.

Comments (0)

June 14, 2025

Feature Flag Anti-Paterns: Learnings from Outages

Filed under: Computing — admin @ 9:54 pm

Feature flags are key components of modern infrastructure for shipping faster, testing in production, and reducing risk. However, they can also be a fast track to complex outages if not handled with discipline. Google’s recent major outage serves as a case study, and I’ve seen similar issues arise from missteps with feature flags. The core of Google’s incident revolved around a new code path in their “Service Control” system that should have been protected with a feature flag but wasn’t. This path, designed for an additional quota policy check, went directly to production without flag protection. When a policy change with unintended blank fields was replicated globally within seconds, it triggered the untested code path, causing a null pointer that crashed binaries globally. This incident perfectly illustrates why feature flags aren’t just nice-to-have—they’re essential guardrails that prevent exactly these kinds of global outages. Google also didn’t implement proper error handling, and the system didn’t use randomized exponential backoff that resulted in “thundering herd” effect that prolonged recovery.

Let’s dive into common anti-patterns I’ve observed and how we can avoid them:

Anti-Pattern 1: Inadequate Testing & Error Handling

This is perhaps the most common and dangerous anti-pattern. It involves deploying code behind a feature flag without comprehensive testing all states of that flag (on, off) and the various condition that interact with the flagged feature. It also includes neglecting robust error handling within the flagged code itself without defaulting flags to “off” in production. For example, Google’s Service Control binary crashed due to a null pointer when a new policy was propagated globally. This didn’t adequately tested the code path with empty input and failed to implement proper error handling. I’ve seen similar issues where teams didn’t test the code path protected with a feature flag in a test environment that only manifest in production. In other cases, the flag was accidentally left ON by default for production, leading to immediate issues upon deployment. The Google incident also mentions the problematic code “did not have appropriate error handling.” If the code within your feature flag assumes perfect conditions, it’s a ticking time bomb. These issues can be remedied by:

Default Off in Production: Ensure all new feature flags are disabled by default in production.
Comprehensive Testing: Test the feature with the flag ON and OFF. Crucially, test the specific conditions, data inputs, and configurations that trigger with the new code paths enabled by the flag.
Robust Error Handling: Implement proper error handling within the code controlled by the flag. It should fail gracefully or revert to a safe state if an unexpected issue occurs, not bring down the service.
Consider Testing Costs: If testing all combinations becomes prohibitively expensive or complex, it might indicate the feature is too large for a single flag and should be broken down.

Anti-Pattern 2: Inadequate Peer Review

This anti-pattern manifests when feature flag changes occur without a proper review process. It’s like making direct database changes in production without a change request. For example, Google’s issue was a policy metadata change rather than a direct flag toggle where metadata replicated globally within seconds. It is analogous to flipping a critical global flag without due diligence. If that policy metadata change had been managed like a code change (e.g., via GitOps or Config as a Code with canary rollout, the issue might have been caught earlier. This can be remedied with:

GitOps/Config-as-Code: Manage feature flag configurations as code within your Git repository. This enforces PRs, peer reviews, and provides an auditable history.
Test Flag Rollback: As part of your process, ensure you can easily and reliably roll back a feature flag configuration change, just like you would with code.
Detect Configuration Drift: Ensure that the actual state in production does not drift from what’s expected or version-controlled.

Anti-Pattern 3: Inadequate Authorization and Auditing

This means not protecting enabling/disabling feature flags with proper permissions. Internally, if anyone can flip a production flag via a UI without a PR or a second pair of eyes, we’re exposed. Also, if there’s no clear record of who changed it, when, and why, incident response becomes a frantic scramble. Remedies include:

Strict Access Control: Implement strong Role-Based Access Control (or Relationship-Based Access Control) to limit who can modify flag states or configurations in production.
Comprehensive Auditing: Ensure your feature flagging system provides detailed audit logs for every change: who made the change, what was changed, and when.

Anti-Pattern 4: No Monitoring

Deploying a feature behind a flag and then flipping it on for everyone without closely monitoring its impact is like walking into a dark room and hoping you don’t trip. This can be remedied by actively monitoring feature flags and collecting metrics on your observability platform. This includes tracking not just the flag’s state (on/off) but also its real-time impact on key system metrics (error rates, latency, resource consumption) and relevant business KPIs.

Anti-Pattern 5: No Phased Rollout or Kill Switch

This means turning a new, complex feature on for 100% of users simultaneously with a flag. For example, during Google’s incident, major changes to quota management settings were propagated immediately causing global outage. The “red-button” to disable the problematic serving path was crucial for their recovery. Remedies for this anti-pattern include:

Canary Releases & Phased Rollouts: Don’t enable features for everyone at once. Perform canary releases: enable for internal users, then a small percentage of production users while monitoring metrics.
“Red Button” Control: Have a clear, a “kill switch” or “red button” mechanism for quickly and globally disabling any problematic feature flag if issues arise.

Anti-Pattern 6: Thundering Herd

Enabling a feature flag can potentially change traffic patterns for incoming requests. For example, Google didn’t implement randomized exponential backoff in Service Control that caused “thundering herd” on underlying infrastructure. To prevent such issues, implement exponential backoff with jitter for request retries, combined with comprehensive monitoring.

Anti-Pattern 7: Misusing Flags for Config or Entitlements

Using feature flags as a general-purpose configuration management system or to manage complex user entitlements (e.g., free vs. premium tiers). For example, I’ve seen teams use feature flags to store API endpoints, timeout values, or rules about which customer tier gets which sub-feature. This means that your feature flag system becomes a de-facto distributed configuration database. This can be remedied with:

Purposeful Flags: Use feature flags primarily for controlling the lifecycle of discrete features: progressive rollout, A/B testing, kill switches.
Dedicated Systems: Use proper configuration management tools for application settings and robust entitlement systems for user permissions and plans.

Anti-Pattern 8: The “Zombie Flag” Infestation

Introducing feature flags but never removing them once a feature is fully rolled out or stable. I’ve seen codebases littered with if (isFeatureXEnabled) checks for features that have been live for years or were abandoned. This can be remedied with:

Lifecycle Management: Treat flags as having a defined lifespan.
Scheduled Cleanup: Regularly audit flags. Once a feature is 100% rolled out and stable (or definitively killed), schedule work to remove the flag and associated dead code.

Anti-Pattern 9: Ignoring Flagging Service Health

This means not considering how your application behaves if the feature flagging service itself experiences an outage or is unreachable. A crucial point in Google’s RCA was that their “Cloud Service Health infrastructure being down due to this outage” delayed communication. A colleague once pointed out: what happens if LaunchDarkly is down? This can be remedied with:

Safe Defaults in Code: When your code requests a flag’s state from the SDK (e.g., ldClient.variation("my-feature", user, **false**)), the provided default value is critical. For new or potentially risky features, this default must be the “safe” state (feature OFF).
SDK Resilience: Feature-Flag SDKs are designed to cache flag values and use them if the service is unreachable (stasis). But on a fresh app start before any cache is populated, your coded defaults are your safety net.

Summary

Feature flags are incredibly valuable for modern software development. They empower teams to move faster and release with more confidence. But as the Google incident and my own experiences show, they require thoughtful implementation and ongoing discipline. By avoiding these anti-patterns – by testing thoroughly, using flags for their intended purpose, managing their lifecycle, governing changes, and planning for system failures – we can ensure feature flags remain a powerful asset.

Comments (0)

March 4, 2025

Code Fossils: My Career Path Through Three Decades of Now-Obsolete Technologies

Filed under: Computing — admin @ 9:45 pm

Introduction

I became seriously interested in computers after learning about Microprocessor architecture during a special summer camp program at school that taught us about how computer systems work among other topics. I didn’t have easy access to computers at my school and I learned a bit more about programming on an Atari system. This hooked me to take some private lessons on programming languages and pursue computer science college studies in college and a career as a software developer spanning three decades.

My professional journey started mostly with mainframe systems and then I shifted more towards UNIX systems and then later to Linux environments. Along the way, I’ve witnessed entire technological ecosystems rise, thrive, and ultimately vanish like programming languages abandoned despite their elegance, operating systems forgotten despite their robustness, and frameworks discarded despite their innovation. I will dig through my personal experience with some of the archaic technologies that have largely disappeared or diminished in importance. I’ve deliberately omitted technologies I still use regularly to spotlight these digital artifacts. These extinct technologies shaped how we approach computing problems and contain the DNA of our current systems. They remind us that today’s indispensable technologies may someday join them in the digital graveyard.

Programming Languages

BASIC & GW-BASIC

I initially learned BASIC on an Atari system and later learned GW-BASIC, which introduced me to graphics programming on IBM XT computers running early DOS versions. The use of line numbers organizing program flow with GOTO and GOSUB statements seemed strange to me but its simplicity helped me to build create programs with sounds and graphics. Eventually, I moved to Microsoft QuickBASIC that had support for procedures and structured programming. This early taste of programming led me to pursue Computer Science in college. I sometimes worry about today’s beginners facing overwhelming complexity like networking, concurrency, and performance optimization just to build a simple web application. BASIC on the other hand was very accessible and rewarding for newcomers despite its limitations.

Pascal & Turbo Pascal

College introduced me to both C and Pascal through Borland’s Turbo compilers. I liked cleaner and more readable syntax of Pascal compared to C. At the time, C had best performance so Pascal couldn’t gain wide adoption and it has largely disappeared from mainstream development. Interestingly, career of Turbo Pascal’s author, Anders Hejlsberg was saved by Microsoft who went on to create C# and later TypeScript. This trajectory taught me that technical superiority alone doesn’t ensure survival.

FORTRAN

During a college internship at a physics laboratory, I learned about FORTRAN running on massive DEC-VAX/VMS systems, which was very popular among scientific computing at the time. While FORTRAN maintains a niche presence in scientific circles but DEC VAX/VMS systems have vanished entirely from the computing landscape. VMS systems were known for powerful, reliable and stable computing environments but DEC failed to adapt to the industry’s shift toward smaller, more cost-effective systems. The market ultimately embraced UNIX variants that offered comparable capabilities at lower price points with greater flexibility. This transition taught me an early lesson in how economic factors often trump technical superiority.

COBOL, CICS and Assembler

My professional career at a marketing firm began with COBOL, CICS, and Assembler on mainframe. JCL (Job Control Language) was used to submit the mainframe jobs that had unforgiving syntax where a misplaced comma could derail an entire batch job. I used COBOL for batch processing applications that primarily processed sequential ISAM files or the more advanced VSAM files with their B-Tree indexing for direct data access. These batch jobs often ran for hours or even days that created long feedback cycles where a single error could cause cascading delays and missed deadlines.

I used CICS for building interactive applications with their distinctive green-screen terminals. I had to use BMS (Basic Mapping Support) for designing the 3270 terminal screen layouts, which was notoriously finicky language. I built my own tool to convert plain text layouts into proper BMS syntax so that I didn’t have to debug syntax errors. The most challenging language that I had to use was mainframe Assembler, which was used for performance-critical system components. These programs were monolithic workhorses —thousands of lines of code in single routines with custom macros simulating higher-level programming constructs. Thanks to the exponential performance improvements in modern hardware, most developers rarely need to descend to this level of programming.

PERL

I first learned PERL in college and embraced it throughout the 1990s as a versatile tool for both system administration and general-purpose programming. Its killer feature—regular expressions—made it indispensable for text processing tasks that would have been painfully complex in other languages. At a large credit verification company, I leveraged PERL’s pattern-matching to automate massive codebase migrations, transforming thousands of lines of code from one library to another. Later, at a major airline, I used similar techniques to upgrade legacy systems to newer WebLogic APIs without manual rewrites.

In the web development arena, I used PERL to build early CGI applications and it was a key component of revolutionary LAMP stack (Linux, Apache, MySQL, PERL) before PHP/Python supplanted it. The CPAN repository was another groundbreaking innovation that allowed reusing shared libraries at scale. I used it along with Mason web templating system at a large online retailer in the mid 2000s and then migrated some of those applications to Java as PERL based systems were difficult to maintain. I found similar experience with other PERL codebases and I eventually moved to Python, which offered cleaner object-oriented design patterns and syntax. Its cultural impact—from the camel book to CPAN—influenced an entire generation of programmers, myself included.

4th Generation Languages

Early in my career, Fourth Generation Languages (4GLs) promised a dramatic boost productivity by providing a simple UI for managing the data. On mainframe systems, I used Focus and SAS for data queries and analytics, creating reports and processing data with a few lines of code. For desktop applications, I used a variety of 4GL environments including dBase III/IV, FoxPro, Paradox, and Visual Basic. These tools were remarkable for their time, offering “query by example” interfaces that allowed quickly build database applications with minimal coding. However, as data volumes grew, the limitations of these systems became painfully apparent. Eventually, I transitioned to object-oriented languages paired with enterprise relational databases that offered better scalability and maintainability. Nevertheless, these tools represent an important evolutionary step that influenced modern RAD (Rapid Application Development) approaches and low-code platforms that continue to evolve today.

Operating Systems

Mainframe Legacy

My career began at a marketing company working on IBM 360/390 mainframes running MVS (Multiple Virtual Storage). I used a combination of JCL, COBOL, CICS, and Assembler to build batch applications that processed millions of customer records. Working with JCL (Job Control Language) was particularly challenging due to its incredibly strict syntax where a single misplaced comma could cause an entire batch run to fail. The feedback cycle was painfully slow; submitting a job often meant waiting hours or even overnight for results. We had to use extensive “dry runs” of jobs to test the business logic —a precursor to what we now call unit testing. Despite these precautions, mistakes happened, and I witnessed firsthand how a simple programming error caused the company to mail catalogs to incorrect and duplicate addresses, costing millions in wasted printing and postage.

These systems also had their quirks: they used EBCDIC character encoding rather than the ASCII standard found in most other systems. They also stored data inefficiently—a contributing factor to the infamous Y2K crisis, as programs commonly stored years as two digits to save precious bytes of memory in an era when storage was extraordinarily expensive. Terminal response times were glacial by today’s standards—I often had to wait several seconds to see what I’d typed appear on screen. Yet despite their limitations, these mainframes offered remarkable reliability. While the UNIX systems I later worked with would frequently crash with core dumps (typically from memory errors in C programs), mainframe systems almost never went down. This stability, however, came partly from their simplicity—most applications were essentially glorified loops processing input files into output files without the complexity of modern systems.

UNIX Variants

Throughout my career, I worked extensively with numerous UNIX variants descended from both AT&T’s System V and UC Berkeley’s BSD lineages. At multiple companies, I deployed applications on Sun Microsystems hardware running SunOS (BSD-based) and later Solaris (System V-based). These systems, while expensive, provided the superior graphics capability, reliability and performance needed for mission-critical applications. I used SGI’s IRIX operating system running on impressive graphical workstations when working at a large physics lab. These systems processed massive datasets from physics experiments by leveraging non-uniform memory access (NUMA) and symmetric multi-processing (SMP) based architecture. IRIX was among the first mainstream 64-bit operating systems, pushing computational boundaries years before this became standard. They were used for visual effects in movies like Jurassic Park to life in 1993, which was amazing to watch. I also worked with IBM’s AIX on SP1/SP2 supercomputers at the physics lab, using Message Passing Interface (MPI) APIs to distribute processing across hundreds of nodes. This message-passing approach ultimately proved more scalable than shared-memory architectures, though modern systems incorporate both paradigms—today’s multi-core processors rely on the same NUMA/SMP concepts pioneered in these early UNIX variants.

On the down side, these systems were very expensive and Moore’s Law enabled commodity PC hardware running Linux to achieve comparable performance at a fraction of the price. I saw a lot of those large systems replaced with a farm of low-cost PCS based on Linux clusters that reduced infrastructure costs drastically. I was deeply passionate about UNIX and even spent most of my savings in the early ’90s on a high-end PowerPC system, which was result of a partnership between IBM, Motorola, Apple, and Sun. This machine could run multiple operating systems including Solaris and AIX, though I primarily used it for personal projects and learning.

DOS, OS/2, SCO and BeOS

For personal computing in the 1980s and early 1990s, I primarily used MS-DOS, even developing several shareware applications and games that I sold through bulletin board systems. DOS, with its command-line interface and conventional/expanded memory limitations taught me valuable lessons about resource optimization that remain relevant even in today. I preferred UNIX-like environments whenever possible so I installed SCO UNIX (based on Microsoft’s Xenix) on my personal computer. SCO was initially respected in the industry before it transformed into a patent troll with controversial patent lawsuits against Linux distributors. I also liked OS/2 and it was a technically superior operating system developed compared to Windows with its support of true pre-emptive multitasking. But it lost to Windows due to massive Microsoft’s market power similar to other innovative competitors like Borland, Novell, and Netscape.

Perhaps the most elegant of these alternative systems was BeOS, which I eagerly tested in the mid-1990s when it released in beta. It supported microkernel design and pervasive multithreading capabilities, and was a serious contender for Apple’s next-generation OS. However, Apple ultimately acquired NeXT instead, bringing Steve Jobs back and adopting NeXTSTEP as the foundation—another case where superior technology lost to business considerations and personal relationships.

Storage Media

My first PC had a modest 40MB hard drive and I relied heavily on floppy disks in both 5.25-inch and later 3.5-inch formats. They took a long time to copy data and made a lot of scratching sounds as both progress indicators and early warning systems for impending failures. In professional environments, I worked with SCSI drives that had better speed and reliability. I generally employed RAID configurations to protect against drive failures. For archiving and backup, I generally used tape drives that were also painfully slow but could store much more data. In mid-1990s, I switched to Iomega’s Zip drives from floppy disks for personal backups that could store up to 100MB compared to 1.44MB floppies. Similarly, I used CD-R and later CD-RW drives for storage that also had slow write speeds initially.

Network Protocols

In my early career, networking was fairly fragmented and I generally used Novell’s proprietary IPX (Internetwork Packet Exchange) protocol and Novell NetWare networks at work. It provided nice support of file sharing and printing service. On mainframe systems, I worked with Token Ring networks that offered more reliable deterministic performance. As the internet was based on TCP/IP, it eventually took over along with UNIX and Linux systems. For file sharing across these various systems, I relied on NFS (Network File System) in UNIX environments and later Samba to bridge the gap between UNIX and Windows systems that used SMB (Server Message Block) protocol. Both solutions were plagued with performance issues due to file locking issues. I spent countless hours troubleshooting “stale file handles” and unexpected disconnections that plagued these early networked file systems.

Databases

My database journey began on mainframe systems with IBM’s VSAM (Virtual Storage Access Method), which wasn’t a true database but provided crucial B-Tree indexing for efficient file access. I also worked with IBM’s IMS, a hierarchical database that organized data in parent-child tree structures. The relational databases were truly revolutionary at the time and I embraced systems like IBM DB2, Oracle, and Microsoft SQL Server. In my college, I took a number of courses in theory of relational databases and appreciated its strong mathematical foundations. However, most of those relational databases were commercial and expensive and I looked at open source projects like MiniSQL but it lacked critical enterprise features like transaction support.

In mid 1990s, I saw object-oriented databases gained popularity along with object-oriented programming that promised to eliminate the “impedance mismatch” between object models and relational tables. I evaluated ObjectStore for some projects and ultimately deployed Versant to manage complex navigation data for traffic mapping systems—predecessors to today’s Google Maps services. These databases elegantly handled complex object relationships and inheritance hierarchies, but introduced their own challenges in querying, scaling, and integration with existing systems. The relational databases later absorbed object-oriented concepts like user-defined types, XML support, and JSON capabilities. Looking back, it taught me that systems built on strong theoretical foundations with incremental adaptation tend to outlast revolutionary approaches.

Security and Authentication

Early in my career, I worked as a UNIX system administrator and relied on /etc/passwd files for authentication that were world-readable, containing password hashes generated with the easily crackable crypt algorithm. For multi-system environments, I used NIS (Network Information Service) to centrally manage user accounts across server clusters. We also commonly used .rhosts files to allow password-less authentication between trusted systems. I later used Kerberos authentication systems to provide stronger single sign-on capabilities for enterprise environments. When working at a large airline, I used Netegrity SiteMinder to implement single sign-on based access. While consulting for a manufacturing company, I built SSO implementations using LDAP and Microsoft Active Directory across heterogeneous systems. The Java ecosystem brought its own authentication frameworks and I worked extensively with JAAS (Java Authentication and Authorization Service) and later Acegi Security before moving to SAML (Security Assertion Markup Language) and OAuth based authentication standards.

Applications & Development Tools

Desktop Applications (Pre-Web)

My early word processing was done in WordStar with its cryptic Ctrl-key commands, before moving to WordPerfect, which offered better formatting control. For technical documentation, I relied on FrameMaker that supported sophisticated layout for complex documents. For spreadsheets, I initially used VisiCalc, which was the original “killer app” on Apple II but later Lotus 1-2-3, which revolutionized common keyboard shortcuts that still exist in Excel today. When working for a marketing company, I used Lotus Notes, a collaboration tool that functioned as an email client, calendar, document management system, and application development platform. On UNIX workstations, I preferred text-based applications like elm and pine for email and lynx text browser when accessing remote machines on telnet.

Chat & Communication Tools

On early UNIX systems at work, I used the simple ‘talk’ command to chat with other users on the system. At home during the pre-internet era, I immersed myself in the Bulletin Board System (BBS) culture. I also hosted my own BBS, learning firsthand about the challenges of building and maintaining online communities. I used CompuServe for access to group forums and Internet Relay Chat (IRC) through painfully slow dial-up and later SLIP/PPP connections. My fascination with IRC led me to develop my own client application, PlexIRC, which I distributed as shareware. As graphical interfaces took over, I adopted ICQ and Yahoo Messenger for personal communications. These platforms introduced status indicators, avatars, and file transfers that we now take for granted. While AOL Instant Messenger dominated the American market, I deliberately avoided the AOL ecosystem, preferring more open alternatives. My professional interest gravitated toward Jabber, which later evolved into the XMPP protocol standard with its federated approach to messaging—allowing different servers to communicate like email. I later implemented XMPP-based messaging solutions for several organizations, appreciating its extensible framework and standardized approach.

Development Environments

On UNIX systems, I briefly wrestled with ‘ed’—a line editor so primitive by today’s standards that its error message was simply a question mark. I quickly graduated to Vi, whose keyboard shortcuts became muscle memory that persists to this day through modern incarnations like Vim and NeoVim. In the DOS world, Borland Sidekick revolutionized my workflow as one of the first TSR (Terminate and Stay Resident) applications. With a quick keystroke, Sidekick would pop up a notepad, calculator, or calendar without exiting the primary application. For debugging and system maintenance, I used Norton Utilities that provided essential tools like disk recovery, defragmentation, and a powerful hex editor that saved countless hours when troubleshooting low-level issues. I learned about the IDE (Integrated Development Environment) through Borland’s groundbreaking products like Turbo Pascal and Turbo C that combined fast compilers with editing and debugging in a seamless package. These evolved into more sophisticated tools like Borland C++ with its application frameworks. For specialized work, I used Watcom C/C++ for its cross-platform capabilities and optimization features. As Java gained prominence, I adopted tools like JBuilder and Visual Cafe, which pioneered visual development for the platform. Eventually, I moved to Eclipse and later IntelliJ IDEA, alongside Visual Studio. Though, I still enable Vi mode on these IDEs due to its powerful editing capabilities without the need of mouse.

Web Technologies

I experienced the early internet ecosystem in college—navigating Gopher menus for document retrieval, searching with WAIS, and participating in Usenet newsgroups. Everything changed with the release of NCSA HTTPd server and the Mosaic browser. I used these revolutionary tools on Sun workstations in college and later at a high-energy physics laboratory on UNIX workstations. I left my cushy job to find web related projects and secured a consulting position at a financial institution building web access for credit card customers. I used C/C++ with CGI (Common Gateway Interface) to build dynamic web applications that connected legacy systems to this new interface. These early days of web development were like the Wild West—no established security practices, framework standards, or even consistent browser implementations existed. During a code review when working at a major credit card company, I discovered a shocking vulnerability: their web application stored usernames and passwords directly in cookies in plaintext, essentially exposing customer credentials to anyone with basic technical knowledge. These early web servers used a process-based concurrency model, spawning a new process for each request—inefficient by modern standards but there wasn’t much user traffic at the time. On the client side, I worked with the Netscape browser, while server implementations expanded to include Apache, Netscape Enterprise Server, and Microsoft’s IIS.

I also built my own Model-View-Controller architecture and templating system because there weren’t any established frameworks available. As Java gained traction, I migrated to JSP and the Struts framework, which formalized MVC patterns for web applications. This evolution continued as web servers evolved from process-based to thread-based concurrency models, and eventually to asynchronous I/O implementations in platforms like Nginx, dramatically improving scalability. Having witnessed the entire evolution—from hand-coded HTML to complex JavaScript frameworks—gives me a unique perspective on how rapidly this technology landscape has developed.

Distributed Systems Development

My journey with distributed systems began with Berkeley Sockets—the foundational API that enabled networked communication between applications. After briefly working with Sun’s RPC (Remote Procedure Call) APIs, I embraced Java’s Socket implementation and then its Remote Method Invocation (RMI) framework, which I used to implement remote services when working as a consultant for an enterprise client. RMI offered the revolutionary ability to invoke methods on remote objects as if they were local, handling network communication transparently and even dynamically loading remote classes. At a major travel booking company, I worked with Java’s JINI technology, which was inspired by Linda memory model and TupleSpace that I also studied during my postgraduate research. JINI extended RMI with service discovery and leasing mechanisms, creating a more robust foundation for distributed applications. I later used GigaSpaces, which expanded the JavaSpaces concept into a full in-memory data grid for session storage.

For personal projects, I explored Voyager, a mobile agent platform that simplified remote object interaction with dynamic proxies and mobile object capabilities. Despite its technical elegance, Voyager never achieved widespread adoption—a pattern I would see repeatedly with technically superior but commercially unsuccessful distributed technologies. While contracting for Intelligent Traffic Systems in the Midwest during the late 1990s, I implemented CORBA-based solutions that collected real-time traffic data from roadway sensors and distributed it to news agencies via a publish-subscribe model. CORBA promised language-neutral interoperability through its Interface Definition Language (IDL), but reality fell short—applications typically worked reliably only when using components from the same vendor. I had to implement custom interceptors to add the authentication and authorization capabilities CORBA lacked natively. Nevertheless, CORBA’s explicit interface definitions via IDL influenced later technologies like gRPC that we still use today. The Java Enterprise (J2EE) era brought Enterprise JavaBeans and I implemented these technologies using BEA WebLogic for another state highway system, and continued working with them at various travel, airline, and fintech companies. EJB’s fatal flaw was attempting to abstract away the distinction between local and remote method calls—encouraging developers to treat distributed objects like local ones. This led to catastrophic performance problems as applications made thousands of network calls for operations that should have been local.

I read Rod Johnson’s influential critique of EJB that eventually evolved into the Spring Framework, offering a more practical approach to Java enterprise development. Around the same time, I transitioned to simpler XML-over-HTTP designs before the industry standardized on SOAP and WSDL. The subsequent explosion of WS-* specifications (WS-Security, WS-Addressing, etc.) created such complexity that the diagram of their interdependencies resembled the Death Star. I eventually abandoned SOAP’s complexity for JSON over HTTP, implementing long-polling and Server-Sent Events (SSE) for real-time applications before adopting the REST architectural style that dominates today’s API landscape. Throughout these transitions, I integrated various messaging systems including IBM WebSphere MQ, JMS implementations, TIBCO Rendezvous, and Apache ActiveMQ to provide asynchronous communication capabilities. This journey through distributed systems technologies reflects a recurring pattern: the industry oscillating between complexity and simplicity, between comprehensive frameworks and minimal viable approaches. The technologies that endured longest were those that acknowledged and respected the fundamental challenges of distributed computing—network unreliability, latency, and the fallacies of distributed computing—rather than attempting to hide them behind leaky abstractions.

Client & Mobile Development

Terminal & Desktop GUI

My journey developing client applications began with CICS on mainframe systems—creating those distinctive green-screen interfaces for 3270 terminals once ubiquitous in banking and government environments. The 4th generation tools era introduced me to dBase and Paradox, which I used to build database-driven applications through their “query by example” interfaces, which allowed rapid development of forms and reports without extensive coding. For personal projects, I developed numerous DOS applications, games, and shareware using Borland Turbo C. As Windows gained prominence, I transitioned to building GUI applications using Borland C++ with OWL (Object Windows Library) and later Microsoft Foundation Classes (MFC), which abstracted the complex Windows API into an object-oriented framework. While working for a credit protection company, I developed UNIX-based client applications using OSF/Motif. Motif’s widget system and resource files offered sophisticated UI capabilities, though with considerable implementation complexity.

Web Clients

The web revolution transformed client development fundamentally. I quickly adopted HTML for financial and government projects, creating browser-based interfaces that eliminated client-side installation requirements. For richer interactive experiences, I embedded Flash elements into web applications—creating animations and interactive components beyond HTML’s capabilities at the time. Java’s introduction brought the promise of “write once, run anywhere,” which I embraced through Java applets that you could embed like Flash widgets. Later, Java Web Start offered a bridge between web distribution and desktop application capabilities, allowing applications to be launched from browsers while running outside their security sandbox. Using Java’s AWT and later Swing libraries, I built standalone applications including IRC and email clients. The client-side JavaScript revolution, catalyzed by Google’s demonstration of AJAX techniques, fundamentally changed web application architecture. I experimented with successive generations of JavaScript libraries—Prototype.js for DOM manipulation, Script.aculo.us for animations, YUI for more component sets, etc.

Embedded and Mobile Development

As Java had its roots in embedded/TV systems, it introduced a wearable smart Java ring with an embedded microchip that I used for some personal security applications. Though the Java Ring quickly disappeared from the market, its technological descendants like the iButton continued finding specialized applications in security and authentication systems. The mobile revolution began in earnest with the Palm Pilot—a breakthrough device featuring the innovative Graffiti handwriting recognition system that transformed how we interacted with portable computers. I embraced Palm development, creating applications for this pioneering platform and carrying a Palm device for years. As mobile technologies evolved, I explored the Wireless Application Protocol (WAP), which attempted to bring web content to the limited displays and bandwidth of early mobile phones but failed to gain widespread adoption. When Java introduced J2ME (Java 2 Micro Edition), I invested heavily in mastering this platform, attracted by its promise of cross-device compatibility across various feature phones. I developed applications targeting the constrained CLDC (Connected Limited Device Configuration) and MIDP (Mobile Information Device Profile) specifications.

The entire mobile landscape transformed dramatically when Apple introduced the first iPhone in 2007—a genuine paradigm shift that redefined our expectations for mobile devices. Recognizing this fundamental change, I learned iOS development using Objective-C with its message-passing syntax and manual memory management. This investment paid off when I developed an iOS application for a fintech company that significantly contributed to its acquisition by a larger trading firm. Early mobile development eerily mirrored my experiences with early desktop computing—working within severe hardware constraints that demanded careful resource management. Despite theoretical advances in programming abstractions, I found myself once again meticulously optimizing memory usage, minimizing disk operations, and carefully managing network bandwidth. This return to fundamental computing constraints reinforced my appreciation for efficiency-minded development practices that remain valuable even as hardware capabilities continue expanding.

Development Methodologies

My first corporate experience introduced me to Total Quality Management (TQM), with its focus on continuous improvement and customer satisfaction. This early exposure taught me a crucial lesson: methodology adoption depends more on organizational culture than on the framework itself. Despite new terminology and reorganized org charts, the company largely maintained its existing practices with superficial changes. Later, I worked with organizations implementing the Capability Maturity Model (CMM), which attempted to categorize development processes into five maturity levels. While this framework provided useful structure for improving chaotic environments, its documentation requirements and formal assessments often created bureaucratic overhead that impeded actual development. Similarly, the Rational Unified Process (RUP), which I used at several companies, offered comprehensive guidance but it turned into waterfall development model in many projects. The agile revolution emerged as a reaction against these heavyweight methodologies. I applied elements of Feature-Driven Development and Spiral methodologies when working at a major airline, focusing on iterative development and explicit risk management. I explored various agile approaches during this period—Crystal’s focus on team communication, Adaptive Software Development’s emphasis on change tolerance, and particularly Extreme Programming (XP), which introduced practices like test-driven development and pair programming that fundamentally changed how I approached code quality. Eventually, most organizations where I worked settled on customized implementations of Scrum and Kanban—frameworks that continue to dominate agile practice today.

Development Methodologies & Modeling

Earlier in my career, approaches like Rapid Application Development (RAD) and Joint Application Development (JAD) emphasized quick prototyping and intensive stakeholder workshops. These methodologies aligned with Computer-Aided Software Engineering (CASE) tools like Rational Rose and Visual Paradigm, which promised to transform software development through visual modeling and automated code generation. On larger projects, I spent months creating elaborate UML diagrams—use cases, class diagrams, sequence diagrams, and more. Some CASE tools I used could generate code frameworks from these models and even reverse-engineer models from existing code, promising a synchronized relationship between design and implementation. The reality proved disappointing; generated code was often rigid and difficult to maintain, while keeping models and code in sync became an exercise in frustration. The agile movement ultimately eclipsed both heavyweight methodologies and comprehensive CASE tools, emphasizing working software over comprehensive documentation.

DevOps Evolution

Version Control System

My introduction to version control came at a high-energy physics lab, where projects used primitive systems like RCS (Revision Control System) and SCCS (Source Code Control System). These early tools stored delta changes for each file and relied on exclusive locking mechanisms—only one developer could edit a file at a time. As development teams grew, most projects migrated to CVS (Concurrent Versions System), which built upon RCS foundations. CVS supported networked operations, allowing developers to commit changes from remote locations, and replaced exclusive locking with a more flexible concurrent model. However, CVS still operated at the file level rather than treating commits as project-wide transactions, leading to potential inconsistencies when only portions of related changes were committed. I continued using CVS for years until Subversion emerged as its logical successor. Subversion’s introduction of atomic commits to ensure that either all or none of a change would be committed. It also improved branching operations, directory management, and file metadata handling, addressing many of CVS’s limitations. While working at a travel company, I encountered AccuRev, which introduced the concept of “streams” instead of traditional branches. This approach modeled development as flowing through various stages. AccuRev proved particularly valuable for managing offshore development teams who needed to download large codebases over unreliable networks and its sophisticated change management reduced bandwidth requirements.

During my time at a large online retailer in the mid-2000s, I worked with Perforce, a system optimized for large-scale development with massive codebases and binary assets. Perforce’s performance with large files and sophisticated security model made it ideal for enterprise environments. I briefly used Mercurial for some projects, appreciating its simplified interface compared to early Git versions, before ultimately transitioning to Git as it became the industry standard. This evolution of version control parallels the increasing complexity of software development itself: from single developers working on isolated files to globally distributed teams collaborating on massive codebases.

Build Systems

I have been using Make probably throughout my career across various platforms and languages. Its declarative approach to defining dependencies and build rules established patterns that influence build tools to this day. After adopting Java ecosystem, I switched to Apache Ant, which used XML to define build tasks as an explicit sequence of operations. This offered greater flexibility and cross-platform consistency but at the cost of increasingly verbose build files as projects grew more complex. I used Ant extensively during Java’s enterprise ascendancy, customizing its tasks to handle deployment, testing, and reporting. I then adopted Maven that introduced revolutionary concepts such as convention-over-configuration philosophy with standardized project structures, dependency management capabilities connected to remote repositories to automatically resolve and download required libraries. Despite Maven’s transformative nature, its rigid conventions and complex XML configuration was a bit frustrating and I later switched to Gradle. Gradle offered Maven’s dependency management with a Groovy-based DSL that provided both the structure of declarative builds and the flexibility of programmatic customization.

The build process expanded beyond compilation when I implemented Continuous Integration using CruiseControl, an early CI server developed by ThoughtWorks. This system automatically triggered builds on code changes, ran tests, and reported results. Later, I worked extensively with Hudson, which offered a more user-friendly interface and plugin architecture for extending CI capabilities. When Oracle acquired Sun and attempted to trademark the Hudson name, the community rallied behind a fork called Jenkins, which rapidly became the dominant CI platform. I used Jenkins for years, creating complex pipelines that automated testing, deployment, and release processes across multiple projects and environments. Eventually, I transitioned to cloud-based CI/CD platforms that integrated more seamlessly with hosted repositories and containerized deployments.

Summary

As I look back across my three decades in technology, these obsolete systems and abandoned platforms aren’t just nostalgic relics—they tell a powerful story about innovation, market forces, and the unpredictable nature of technological evolution. The technologies I’ve described throughout this blog didn’t disappear because they were fundamentally flawed. Pascal offered cleaner syntax than C, BeOS was more elegant than Windows, and CORBA attempted to solve distributed computing problems we still grapple with today. Borland’s superior development tools lost to Microsoft’s ecosystem advantages. Object-oriented databases, despite solving real problems, couldn’t overcome the momentum of relational systems. Yet these extinct technologies left lasting imprints on our industry. Anders Hejlsberg, who created Turbo Pascal, went on to shape C# and TypeScript. The clean design principles of BeOS influenced aspects of modern operating systems. Ideas don’t die—they evolve and find new expressions in subsequent generations of technology.

Perhaps the most valuable lesson is about technological adaptability. Throughout my career, the skills that have remained relevant weren’t tied to specific languages or platforms, but rather to fundamental concepts: understanding data structures, recognizing patterns in system design, and knowing when complexity serves a purpose versus when it becomes a hurdle. The industry’s constant reinvention ensures that many of today’s dominant technologies will eventually face their own extinction event. By understanding the patterns of the past, we gain insight into which current technologies might have staying power. This digital archaeology isn’t just about honoring what came before—it’s about understanding the cyclical nature of our industry and preparing for what comes next.

Comments (0)

September 21, 2024

Robust Retry Strategies for Building Resilient Distributed Systems

Filed under: API,Computing,Microservices — admin @ 10:50 am

Introduction

Distributed systems inherently involve multiple components such as services, databases, networks, etc., which are spread across different machines or locations. These systems are prone to partial failures, where one part of the system may fail while others remain operational. A common strategy for building fault-tolerant and resilient systems is to recover from transient failures by retrying failed operations. Here are some common use cases for implementing retries to maintain reliability in such environments:

Recover from Transient Failures such as network glitches, dropped packets, or temporary unavailability of services. These failures are often short-lived, and a simple retry may succeed without any changes to the underlying system.
Recover from Network Instability due to packet loss, latency, congestion, or intermittent connectivity can disrupt communication between services.
Recover from Load Shedding or Throttling where services may experience momentary overloads and are unable to handle incoming requests.
Asynchronous Processing or Eventual Consistency models may take time to converge state across different nodes or services and operations might fail temporarily if the system is in an intermediate state.
Fault Isolation in microservices architectures, where services are loosely coupled but depend on one another. The downstream services may fail temporarily due to a service restart, deployment or scaling activities.
Service Downtime affects availability of services but client application can use retries to recover from minor faults and maintain availability.
Load Balancing and Failover with redundant Zones/Regions so that when a request to one zone/region fails but can be handled by another healthy region or zone.
Partial Failures where one part of the system fails while the rest remains functional (partial failures).
Build System Resilience to allow the system to self-heal from minor disruptions.
Race Conditions or timing-related issues in concurrent systems can be resolved with retries.

Challenges with Retries

Retries help in recovering from transient or partial failures by resending requests, but they can worsen system overloads if not managed carefully. Here are some challenges associated with retries:

Retry Storms: A retry storm occurs when multiple clients or services simultaneously retry failed requests to an overloaded or recovering service. This flood of retries can exacerbate the problem and can lead to performance degradation or a self-inflicted Denial of Service (DoS) attack.
Idempotency and Data Consistency: Some operations are not idempotent and performing them multiple times can lead to inconsistent or incorrect results (e.g., processing a financial transaction multiple times).
Cascading Failures: Retrying can propagate failures upstream or to dependent services. For instance, when a service fails and clients retry excessively, which can overwhelm downstream services.
Latency Amplification: Retrying failed operations can increase end-to-end latency, as each retry adds a delay before successful resolution.
Amplified Resource Consumption: Retried operations consume additional CPU, memory, and bandwidth, potentially depleting resources at a faster rate. Even when services eventually succeed, the increased load from retries can harm the overall system.
Retry Loops or Infinite Retries: If a failed operation is retried continuously without ever succeeding, it can potentially lead to system crashes.
Threads and connections starvation: When a service invokes multiple operations and some fail, it may retry all operations, leading to increased overall request latency. If high timeouts are set, threads and connections remain occupied, blocking new traffic.
Unnecessary Retries on Non-Retryable Failures: Retrying certain types of failures, like authorization errors or malformed requests is unnecessary and wastes system resources.
Timeout Mismatch Between Services: If the timeout settings for retries between services are not aligned, a downstream service may still be processing a request while the upstream service retries or times out that can result in conflicting states.

Considerations for Retries

Here are some key considerations and best practices for implementing more effective and safer retry mechanisms in distributed systems, enhancing resilience while safeguarding system stability during periods of stress or failure:

Timeouts: Implement timeouts to prevent clients from waiting indefinitely for a response and reduce resource exhaustion (e.g., memory or threads) caused by prolonged waiting. The challenge lies in selecting the appropriate timeout value: if set too high, resources are wasted; if set too low, it can trigger excessive retries, which increases the risk of outages. It’s recommended to set timeouts that are tightly aligned with performance expectations, ideally less than 2-times your maximum response time to avoid thread starvation. Additionally, monitor for early warning signs by setting alarms when performance degrades (e.g., when P99 latency approaches 50% of the timeout value).
Timeout Budgeting: In complex distributed systems, timeout budgeting ensures that the total time taken by a request across multiple services doesn’t exceed an acceptable limit. Each downstream service gets a portion of the total timeout, so failure in one service doesn’t excessively delay the entire request chain.
Exponential Backoff: Implement exponential backoff to spread out retry attempts by gradually increasing the delay between retries, reducing the risk of overwhelming a failing component and allowing time for recovery. It’s important to cap the backoff duration and limit the total number of retries. Without these limits, the system might continue retrying unnecessarily even after the underlying issue has been resolved.
Jitter: Adding randomness (jitter) to the backoff process helps prevent synchronized retries that could lead to overload spikes. Jitter is useful for spreading out traffic spikes and periodic tasks to avoid large bursts of traffic at regular intervals for improving system stability.
Idempotency: Operations that are retried must be idempotent, meaning they can be safely repeated without causing unintended side effects (e.g., double payments or duplicated data).
Retry Limits: Retries should be capped at a certain limit to avoid endlessly retrying a failing operation. Retries should stop beyond a certain number of attempts and the failure should be escalated or reported.
Throttling and Rate Limiting: Implement throttling or rate limiting and control the number of requests a service handles within a given time period. Rate limiting can be dynamic, which is adjusted based on current load or error rates, and avoid system overloads during traffic spikes. In addition, low-priority requests can be shed during high load situations.
Error Categorization: Not all errors should trigger retries and use an allowlist for known retryable errors and only retry those. For example, 400 Bad Request (indicating a permanent client error) due to invalid input should not be retried, while server-side or network-related errors with a 500 Internal Server Error (a likely transient issue) can benefit from retrying.
Targeting Failing Components Only: In a partial failure, not all parts of the system are down and retries help isolate and recover from the failing components by retrying operations specifically targeting the failed resource. For example, if a service depends on multiple microservices for an operation and one of the service fails, the system should retry the failed request without repeating the entire operation.
Intelligent and Adaptive Retries: Design retry logic to take the system’s current state into account, such as checking service health or load conditions before retrying. For example, increase retry intervals if multiple components are detected as failing or retry quickly for timeout errors but back off more for connection errors.. This prevents retries when the system is already known to be overloaded.

Retrying at Different Levels: Retries can be implemented at various levels to handle partial failures such as application level, middleware/proxy (load-balancer or API gateway), transport level (network). For example, a distributed system using a load balancer can detect if a specific instance of a service is failing and reroute traffic to a healthy instance that triggers retries only for the requests that target the failing instance.
Retry Amplification: In multi-tiered architectures, if retries are implemented at each level of nested service calls, it can lead to increased latency and exponentially higher traffic. To mitigate this, implement retries only at critical points in the call chain, and ensure that each service has a clear retry policy with limits. Use short timeouts to prevent thread starvation when calls to downstream services take too long. If too many threads hang, new traffic will be blocked.
Retry Budget: Implementing a global limit on the number of retries across all operations helps prevent system overload. For example, using an algorithm like Leaky Bucket can regulate the number of retries within a specified time period. This ensures that retries are distributed evenly and don’t exceed system capacity, preventing resource exhaustion during high failure rates.
Retries with Circuit Breakers: The circuit breaker pattern can be combined with retries to avoid overwhelming a failing component. When a service starts failing, the circuit breaker opens, temporarily halting requests to that service until it is healthy again. Retries can be configured to happen only after the circuit breaker transitions to a half-open state, which allows a limited number of retries to test if the service has recovered.
Retries with Failover Mechanisms: Retries can be designed with failover strategies where the system switches to a backup service, region, or replica in case of partial failure. If a service in one region fails then the retries can redirect requests to a different region or zone for ensuring availability.
Latency Sensitivity: Services with strict latency requirements might not tolerate long backoff periods or extended retries so they should minimize number of retries and cap backoff times.
Sync Calls: For synchronous calls, retry once immediately to handle temporary network issues and avoid multiple retries that could lead to thread starvation. Avoid excessive sleeping of threads between retries, which can lead to thread starvation. Also, a Circuit Breaker can be used to prevent retrying if a high percentage of calls fail.
Async Calls: Use exponential backoff with jitter for asynchronous operations and use Circuit Breakers to stop retries when failure rates are high. Asynchronous APIs can queue requests for later retries, but should incorporate health checks to ensure that retry attempts don’t add excessive load to downstream services during recovery periods.
Retrying on Overload Responses: Recognize overload indicators (e.g., HTTP 503 responses) and avoid retries when the response indicates overload.
Fail-Fast: Detect issues early and fails quickly rather than continuing to process failing requests or operations to avoid wasting time on requests that are unlikely to succeed.
Graceful Degradation: Provide an alternative method of handling requests when a service fails. For example, if a primary service is down, a cached result or a simpler backup service can be used instead.
Downstream Bugs: Rather than implementing retry-based workarounds, prioritize having downstream service owners address and resolve the underlying issues.
Monitor and Analyze Retry Patterns: Implement monitoring for retry attempts and success rates, and analyze the data to gain insights into system behavior during failures. Use these insights to optimize retry strategies, such as adjusting backoff intervals and fine-tuning timeouts for improved system performance.
SLAs with Downstream Services: Establish clear service-level agreements (SLAs) with downstream services about call frequency, failure rates, and latency expectations.
Availability Over Consistency: Prioritize service availability over consistency where possible, especially during retries or failure handling. In such cases, retries might return stale data or cause inconsistency issues, so it’s crucial to align retry policies with system design.
Chaos Engineering: Chaos engineering involves intentionally injecting failures, such as server crashes or network disruptions, into a system to test its resilience under adverse conditions. By simulating real-world failures, teams can identify weaknesses and ensure that the retry policies are working as expected.
Bulkhead Pattern: The bulkhead pattern isolates different parts of a system to prevent a failure in one part from affecting the rest of the system. The bulkheads can be implemented by limiting the number of resources (threads, memory, connections) allocated to each service or subsystem so that if one service becomes overloaded or fails, it won’t exhaust resources that other services need.
System Design: It’s essential to design APIs to minimize unnecessary communication with the server. For instance, in an event-driven architecture, if an event is missing a required attribute, the application might need to make additional requests to retrieve that data, increasing system load. To avoid this, ensure that events are fully populated with all necessary information upfront.

Summary

Retries are an essential mechanism for building fault-tolerant distributed systems and to recover from transient failures such as network issues, service unavailability, and partial system outages. A well-implemented retry strategy improves system resilience by ensuring that temporary failures don’t lead to full-blown outages. Techniques such as exponential backoff with jitter, idempotency, token buckets to limit retries locally, and circuit breakers help manage retries effectively, preventing issues like retry storms, resource exhaustion, and latency amplification.

However, retries need careful management because without proper limits, retries can overwhelm services that are already struggling or exacerbate issues like cascading failures and thread starvation. Incorporating timeouts, retry limits, and adaptive retry mechanisms based on system health can prevent these negative side effects. By analyzing retry patterns and adopting error-specific handling strategies, distributed systems can strike a balance between availability and resource efficiency, and ensures robust performance even in the face of partial failures.

Comments (0)

July 22, 2024

Security Challenges in Microservice Architecture

Filed under: Security,Web Services — admin @ 3:58 pm

Abstract

Microservice architectures have gained wide adoption due to their ability to deliver scalability, agility, and resilience. However, the distributed nature of microservices also introduces new security challenges that must be addressed proactively. Security in distributed systems revolves around three fundamental principles: confidentiality, integrity, and availability (CIA). Ensuring the CIA triad is maintained is crucial for protecting sensitive data and ensuring system reliability for MicroServices.

Confidentiality ensures that sensitive information is accessible only to authorized users using encryption, access controls, and strong authentication mechanisms [NIST SP 800-53 Rev. 5].
Integrity guarantees that data remains accurate and unaltered during storage or transmission using cryptographic hash functions, digital signatures, and data validation processes [Software and Data Integrity Failures].
Availability ensures that systems and data are accessible to authorized users when needed. This involves implementing redundancy, failover mechanisms, and regular maintenance [ISO/IEC 27001:2022].

Below, we delve into these principles and the practices essential for building secure distributed systems. We then explore the potential security risks and failures associated with microservices and offer guidance on mitigating them.

Security Practices

The following key practices help establish a strong security posture:

Strong Identity Management: Implement robust identity and access management (IAM) systems to ensure that only authenticated and authorized users can access system resources. [AWS IAM Best Practices].
Fail Safe: Maintain confidentiality, integrity and availability when an error condition is detected.
Defense in Depth: Employ multiple layers of security controls to protect data and systems. This includes network segmentation, firewalls, intrusion detection systems (IDS), and secure coding practices [Microsoft’s Defense in Depth].
Least Privilege: A person or process is given only the minimum level of access rights (privileges) that is necessary for that person or process to complete an assigned operation.
Separation of Duties: This principle, also known as separation of privilege, requires multiple conditions to be met for task completion, ensuring no single entity has complete control, thereby enhancing security by distributing responsibilities.
Zero Trust Security: Not trust any entity by default, regardless of its location, and verification is required from everyone trying to access resources. [NIST Zero Trust Architecture].
Auditing and Monitoring: Implement comprehensive logging, monitoring, and auditing practices to detect and respond to security incidents [Center for Internet Security (CIS) Controls].
Protecting Data in Motion and at Rest: Use encryption to protect data during transmission (data in motion) and when stored (data at rest). [NIST’s Guide to Storage Encryption Technologies for End User Devices].

Security Methodologies and Frameworks

Following practices ensure that security is integrated throughout the development lifecycle and that potential threats are systematically addressed.

DevSecOps: Integrate security practices into the DevOps process to shift security left, addressing issues early in the software development lifecycle.
Security by Design (SbD): Incorporate security by design process to ensure robust and secure systems [OWASP Secure Product Design]. The key principles of security by design encompass Memory safe programming languages, Static and dynamic application security testing, Defense-in-Depth, Single sign-on, Secure Logging, Data classification, Secure random number generators, Limit the scope of credentials and access, Address Space Location Randomization(ASLR) and Kernel ASLR (KASLR), Encrypt data at rest optionally with customer managed keys, Encrypt data in transit, Data isolation with multi-tenancy support, Strong secrets management, Principle of Least Privilege and Separation of Duties, and Principle of Security-in-the-Open.
Threat Modeling Techniques: Threat modeling involves identifying potential threats to your system, understanding what can go wrong, and determining how to mitigate these risks. Following threat model techniques can be used for identifying and categorizing potential security threats such as
- STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) categorizes different types of threats.
- PASTA (Process for Attack Simulation and Threat Analysis) a risk-centric threat modeling methodology [SEI Threat Modeling].
- VAST (Visual, Agile, and Simple Threat) scales and integrates with Agile development processes.
CAPEC (Common Attack Pattern Enumeration and Classification): A comprehensive dictionary of known attack patterns, providing detailed information about common threats and mitigation techniques.
OCTAVE (Operationally Critical Threat, Asset, and Vulnerability Evaluation): A risk-based strategic assessment and planning technique for cybersecurity by the Carnegie Mellon University’s Software Engineering Institute.
OWASP Application Security Verification Standard (ASVS): A standard for designing, building, and testing secure applications.
OWASP Top Ten: Top Web Application Security Risks such as:
- A01: Broken Access Control.
- A02: Cryptographic Failures.
- A03: Injection including Cross-site Scripting.
- A04: Insecure Design.
- A05: Security Misconfiguration.
- A06: Vulnerable and Outdated Components.
- A07: Identification and Broken Authentication Failures.
- A08: Software and Data Integrity Failures including Insecure Deserialization.
- A09: Security Logging and Monitoring Failures including various failures impacting visibility and incident response.
- A10: Server-Side Request Forgery (SSRF).
CWE TOP 25 Most Dangerous Software Errors:
- CWE-787: Out-of-bounds Write
- CWE-79: Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)
- CWE-89: Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)
- CWE-416: Use After Free
- CWE-78: Improper Neutralization of Special Elements used in an OS Command (‘OS Command Injection’)
- CWE-20: Improper Input Validation
- CWE-125: Out-of-bounds Read
- CWE-22: Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)
- CWE-352: Cross-Site Request Forgery (CSRF)
- CWE-434: Unrestricted Upload of File with Dangerous Type
- CWE-862: Missing Authorization
- CWE-476: NULL Pointer Dereference
- CWE-287: Improper Authentication
- CWE-190: Integer Overflow or Wraparound
- CWE-502: Deserialization of Untrusted Data
- CWE-77: Improper Neutralization of Special Elements used in a Command (‘Command Injection’)
- CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
- CWE-798: Use of Hard-coded Credentials
- CWE-918: Server-Side Request Forgery (SSRF)
- CWE-306: Missing Authentication for Critical Function
- CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization (‘Race Condition’)
- CWE-269: Improper Privilege Management
- CWE-94: Improper Control of Generation of Code (‘Code Injection’)
- CWE-863: Incorrect Authorization
- CWE-276: Incorrect Default Permissions
SAMM (Software Assurance Maturity Model) A framework for analyzing software security practices.
CERT (Computer Emergency Response Team) Coding Standards: Standards for secure coding guidelines developed by the Software Engineering Institute (SEI) at Carnegie Mellon University.
SANS Secure Coding Practices: provides secure coding resources.
NIST (National Institute of Standards and Technology) Cybersecurity Framework: A comprehensive cybersecurity framework that includes guidelines for secure software development and risk management.
ISO/IEC 27034 (Application Security): An international standard provides guidance on information security for application services across their entire lifecycle, including design, development, testing, and maintenance.
BSIMM (Building Security In Maturity Model): A framework that helps organizations plan, execute, and measure their software security initiatives.
SAFECode (Software Assurance Forum for Excellence in Code): A non-profit organization for secure software development.
DREAD(Damage potential, Reproducibility, Exploitability, Affected users, and Discoverability: A model to rate and prioritize risks.

Security Incidents

Security incidents often result from inadequate security measures and oversight, highlighting the importance of rigorous security practices across various aspects of system management and software development.

Static Analysis of Source Code: The Heartbleed vulnerability in the OpenSSL cryptographic library allowed attackers to read sensitive memory contents and went undetected for over two years due to lack of static analysis and code reviews.
Scope of Privileges and Credentials: In Capital One Data Breach (2019), a former AWS employee exploited misconfigured web application firewall (WAF) credentials to gain unauthorized access to Capital One’s AWS environment, leading to the theft of over 100 million customer records that cost Capital One $270m.
Random numbers: Lack of secure random number generation, encryption algorithms and encryption configuration have caused numerous security breaches such as security deficiencies in IoT devices due to bad random number generation, factory-default passwords with security cameras, and attacking SSL with with RC4.
Data Classification: Improperly classifying and handling data based on its sensitivity and criticality have been source of security incidents like Equifax Data Breach (2017), which exposed personal information of over 143 million consumers and McDonald’s Data Leak (2017) that leaked personal information about 2.2 million users.
Secure Logging: Failure to implement secure logging led incidents like Apache Log4j Vulnerability (2021) that affected numerous applications and systems. Similarly, the lack of logging made it difficult to detect and investigate SolarWinds Supply Chain Attack (2020) that compromised numerous government agencies and companies.
Unauthorized Access to Production Data: Failing to implement appropriate controls and policies for governing production data use has led to significant data breaches. For example:
- Uber Data Breach (2016) when an attacker gained access to production environments and stole sensitive data of over 57 million customers and drivers.
- Facebook Data Leak ((2019) leaked personal information of 530 million people due to misconfigured Amazon S3 buckets.
- Capital One Data Breach (2019) exposed personal information of over 100 million customers due to misconfigured WAF credentials.
Filesystem Security: Failing to properly configure filesystem security led to critical issues such as Dirty Cow Vulnerability (2016) that caused privilege escalation vulnerability, and Shellshock Vulnerability (2014) that allowed remote code execution by exploiting vulnerabilities.
Memory protection with ASLR and KASLR: Failing to implement Address Space Layout Randomization (ASLR) and Kernel Address Space Layout Randomization (KASLR) led to the Linux Kernel Flaw (CVE-2024-0646), which exposed systems to privilege escalation attacks.
Code Signing: Failing to properly securely signing code with cryptographic digital signatures led to incidents like ASUS Live Update Hack (2019) that affected about 1 million users and Stuxnet Worm (2010) that targeted programmable logic controllers (PLCs).
Data Integrity: Failure to implement data integrity verification with cryptographic hashing, digital signatures, or checksums can lead to incidents like:
- Petya Ransomware Attack (2017): The Petya ransomware, specifically the “NotPetya” variant employed advanced propagation methods, including leveraging legitimate Windows tools and exploiting known vulnerabilities like EternalBlue and EternalRomance.
- Bangladesh Bank Cyber Heist (2016): Hackers compromised the bank’s systems and initiated fraudulent SWIFT transactions due to the lack of appropriate data integrity controls.
Data Privacy: implementing controls to protect data privacy using data minimization, anonymization, encryption, access controls, and compliance with GDPR/CCPA regulations can prevent incidents like:
- Facebook Cambridge Analytica Scandal (2018): Facebook’s lax privacy controls and data sharing practices led to the exposure of 87 million Facebook profiles to third-party companies like Cambridge Analytica.
- Marriott International Data Breach (2018): A data breach at Starwood, acquired by Marriott, exposed personal information of up to 500 million guests due to inadequate privacy and security measures.
Customer Workloads in Multi-tenant environments: Failing to implement proper security controls and isolation mechanisms when executing customer-provided workloads in a multi-tenant environment can lead to incidents like:
- Azure Functions Vulnerability: Researchers discovered a vulnerability in Azure Functions that allows privilege escalation bug to potentially permitting an attacker to “plant a backdoor which would have run in every Function invocation”.
- Docker Container Escape Vulnerability (2019): a vulnerability in runC was reported by its maintainers that affects Docker containers to gain root-level access on the host.
Certificate Revocation Validation: Verifying that the digital certificates used for authentication and encryption have not been revoked or compromised using a certificate revocation list (CRL) or using the Online Certificate Status Protocol (OCSP) can prevent incidents like:
- DigiNotar Certificate Authority Breach (2011): DigiNotar’s certificate authority was compromised, allowing attackers to issue fraudulent certificates for various domains.
- WoSign and StartCom Certificate Authority Incident (2016-2017): These certificate authorities were found to have improperly issued certificates, leading to their removal from trusted root stores.
Encryption and Key Rotation/Lengths: Failures in encryption and key management have led to significant security breaches:
- Marriott Data Breach (2018) that exposed data of up to 500 million guests due to improper handed encryption keys.
- Adobe Data Breach (2013) that impacted At Least 38 Million Users by using reversible encryption for password storage with weak algorithms like 3DES.
- Dropbox Data Breach (2012) impacted 68 Million user passwords due to improper encryption policies.
Secure Configuration: Failure to implement secure configurations and changes and change management can lead to incidents like AWS S3 Bucket Misconfiguration (2017) where sensitive data from various organizations was exposed due to misconfigured AWS S3 bucket permissions.
Secure communication protocols: Failure to implement secure communication protocols, such as TLS/SSL, to protect data in transit and mitigate man- in-the-middle attacks can lead to incidents like:
- POODLE Attack (2014) exploited a vulnerability in the way SSL 3.0 handled padding, allowing attackers to decrypt encrypted connections.
- FREAK Attack (2015) exploited a vulnerability in legacy export-grade encryption to allow attackers to perform man-in-the-middle attacks.
Secure Authentication: Failure to implement secure authentication mechanisms, such as multi-factor authentication (MFA) and strong password policies can lead to unauthorized access like:
- Dropbox Employee Credentials Theft (2016): Hackers leaked 68M user credentials when they stole a Dropbox employee’s credentials.
- Yahoo Data Breach (2013-2014): Multiple data breaches at Yahoo between 2013 and 2014 exposed billions of user accounts and passwords.
- SolarWinds Password Spraying Attack (2020): The SolarWinds supply chain attack also involved a password spraying attack against SolarWinds employees.
Secure Backup and Disaster Recovery: Failure to implement secure procedures for data backup and recovery, including encryption, access controls, and offsite storage, can lead to incidents such as:
- Code Spaces Data Loss (2014): The Code Spaces was forced to shut down after a catastrophic data loss incident due to a lack of secure backup and recovery procedures.
- Garmin Ransomware Attack (2020): Garmin was hit by a ransomware attack that disrupted its services and operations, highlighting the importance of secure data backup and recovery procedures.
Secure Caching: Implementing proper authentication, access controls, and encryption prevent data leaks or unauthorized access like:
- Cloudflare Data Leak (2017): A vulnerability in Cloudflare’s cache servers resulted in sensitive data leaking across different customer websites, exposing sensitive information.
- Memcached DDoS Attacks (2018): Misconfigured Memcached servers were exploited by attackers to launch massive distributed denial-of-service (DDoS) attacks.
Privilege Escalation (Least Privilege): Improper privilege management caused Edward Snowden Data Leaks (2013) which allowed Snowden to copy and exfiltrate sensitive data from classified systems. In Capital One Data Breach (2019) breach, an overly permissive IAM policy granted broader access than necessary, violating the principle of least privilege. In addition, a contingent authorization can be granted for temporary or limited access to resources or systems based on specific conditions or events.
SPF, DKIM, DMARC: implement the email authentication such as SPF (Sender Policy Framework), DKIM (DomainKeys Identified Mail) and DMARC (Domain-based Message Authentication, Reporting, and Conformance), and anti-spoofing mechanisms for all domains.
Multitenancy: Implement secure and isolated processing of service requests in a multi-tenant environment to prevent unauthorized access or data leakage between different tenants like:
- Microsoft Azure Azure Cosmos DB Vulnerability (2021): A flaw in Microsoft’s Azure Cosmos DB database product left more than 3,300 Azure customers open to complete unrestricted access by attackers.
- Salesforce Community Cloud Incident (2019): A misconfiguration in Salesforce’s Community Cloud allowed unauthorized users to access and modify data belonging to other tenants.
- 2018 Google data breach: A bug in Google+ exposed the private data of approximately 500,000 Google+ users to the public.
Identity Management in Mobile applications: Insecure authentication, authorization, and user management mechanisms can lead to incidents like:
- Starbucks App Vulnerability (2014): A vulnerability in Starbucks’ mobile app endangers user information by storing their usernames, email addresses and passwords in plain text.
- Venmo Mobile App Vulnerability (2016): The SMS-based feature in Venmo app allowed users to authorize payments by replying to a text message, which enabled attackers to steal money from the user’s account.
Secure Default Configuration: The systems and applications should be designed and configured to be secure by default to prevent incidents like:
- MongoDB Ransomware Attacks (2016-2017): 23k MongoDB databases with default configurations were targeted by ransomware attacks due to the default configuration exposing them to the internet.
- Elasticsearch Ransomware Attacks (2019): Misconfigured Elasticsearch clusters were targeted by ransomware attacks due to the default configuration allowing remote access.
- CouchDB Vulnerabilities (2018): Unsecured CouchDB instances were targeted by attackers due to the default configuration exposing them to the internet.
Server-side Template Injection (SSTI): A vulnerability that occurs when user-supplied input is improperly interpreted as part of a server-side template engine, leading to the potential execution of arbitrary code.
- SSTI in Apache Freemarker (2022): A SSTI vulnerability in the Apache Freemarker templating engine allowed remote code execution in various applications.
- SSTI in Jinja2: Illustrates how Server-Side Template Injection (SSTI) payloads for Jinja2 can enable remote code execution.
Reverse Tabnabbing: A security vulnerability that occurs when a website you trust opens a link in a new tab and an attacker manipulates the website contents with malicious contents.
- WordPress Plugin Vulnerabilities (2016): Multiple WordPress plugins were found to be vulnerable to reverse tabnabbing attacks.
- Outlook Web Access (OWA) Attack (2018): Attackers exploited a reverse tabnabbing vulnerability in Outlook Web Access.
Regions and Partitions Isolation: Isolating the security and controls for each region and partition helps prevent security vulnerabilitiessuch as:
- AWS US-East-1 Outage (2017): An operational issue in AWS’s US-East-1 region caused widespread service disruptions, affecting numerous customers and services hosted in that region.
- Google Cloud Engine Outage (2016): A software bug in Google’s central data center caused cascading failures and service disruptions across multiple regions.
External Dependencies: Regularly reviewing and assessing external (Software-Defined Object) dependencies for potential security vulnerabilities can mitigate supply chain attacks and security breaches like:
- Equifax Data Breach (2017): The breach was caused by the failure to patch a vulnerability in the Apache Struts open-source framework used by Equifax.
- Log4Shell Vulnerability (2021): A critical vulnerability in the Log4j library, used for logging in Java applications, allowed attackers to execute arbitrary code on affected systems.
Circular Dependencies: Avoiding circular dependencies in software design can prevent incidents like:
- Node.js Event-Stream Incident (2018): A malicious actor gained control of the popular “event-stream” package and injected malicious code.
- Left-Pad Incident (2016): Although not a direct security breach, the removal of the “left-pad” npm package broke thousands of projects due to its circular dependencies.
- Windows DLL Hijacking: Complex dependency management can lead to DLL hijacking that can execute malicious code.
Confused Deputy: The “Confused Deputy” problem, which occurs when a program inadvertently performs privileged operations on behalf of another entity, leading to security breaches:
- Google Docs Phishing Attack (2017): Attackers exploited a feature in Google Docs to trick users into granting permission to a malicious app disguised as Google Docs.
- Android Toast Overlay Attack (2017): A vulnerability in the Android operating system allowed malicious apps to display overlay Toast messages that could intercept user input or perform actions without user consent.
Validation Before Deserialization: Failure to validate the deserialized data can lead to security vulnerabilities, such as code execution or data tampering attacks like:
- Apache Commons Collections Deserialization Vulnerability (2015): A vulnerability in the Apache Commons Collections library allowed remote code execution by exploiting insecure deserialization.
- WebLogic Deserialization Vulnerability (2015): A deserialization vulnerability in Oracle’s WebLogic Server allowed remote code execution and complete server takeover.
Generic Error Messages: implement proper error handling and return generic error messages rather than exposing sensitive information or implementation details.
- Apache Struts Error Message Vulnerability (2017): A vulnerability in the Apache Struts framework allowed attackers to gain sensitive information through detailed error messages.
- Microsoft Exchange Server Error Disclosure Vulnerability (2021): A vulnerability in Microsoft Exchange Server allowed attackers to gain sensitive information through detailed error messages.
Monitoring: Failure to implement proper logging and monitoring mechanisms can make it difficult to detect and respond to security incidents.
- Uber’s Data Breach (2016): Uber failed to properly monitor and respond to security alerts, resulting in a delayed discovery of the data breach that exposed data of 57 million users and drivers.
- Target Data Breach (2013): Inadequate logging and monitoring allowed attackers to remain undetected in Target’s systems for several weeks, resulting in the theft of millions of credit card records.
Secure Web Design: Implement input validation, secure session management, cross-site scripting (XSS) prevention, cross-site request forgery (CSRF) protection, and industry best practices to prevent incidents like:
- SQL Injection Attacks on Sony (2011): Sony’s PlayStation Network was compromised due to SQL injection vulnerabilities.
- Heartland Payment Systems Breach (2008): Poor input validation allowed attackers to inject malicious SQL queries, resulting in the theft of credit card data from over 100 million payment card transactions.
- Panera Bread Data Leak (2018): Poor session management practices and the exposure of session tokens allowed attackers to access user data through exposed session cookies.
- Yahoo Email XSS Vulnerability (2016): An XSS flaw in Yahoo’s email service allowed attackers to steal cookies and gain unauthorized access to user accounts.
- Gmail CSRF Attack (2007): A vulnerability in Gmail allowed attackers to change users’ settings by tricking them into visiting malicious websites due to a lack of CSRF protection.
- CSP Bypass Vulnerability in Google (2018): A vulnerability in Google’s Content Security Policy implementation allowed attackers to bypass XSS protections and execute malicious scripts.
- Zoom’s Insecure Design Vulnerabilities (2020): Zoom’s rapid growth during the pandemic exposed several design flaws, including lack of end-to-end encryption and vulnerabilities that allowed unauthorized access to meetings.

Summary

Microservice architectures offer scalability, agility, and resilience but also present unique security challenges. Addressing these challenges requires adhering to the principles of confidentiality, integrity, and availability (CIA). Key security practices include strong identity management, defense in depth, principle of least privilege, zero trust security, comprehensive auditing and monitoring, and protecting data in motion and at rest. Security methodologies and frameworks like DevSecOps, Security by Design (SbD), and threat modeling techniques (e.g., STRIDE, PASTA) ensure robust security integration throughout the development lifecycle. Real-world incidents highlight the consequences of inadequate security measures. Implementing secure communication protocols, authentication mechanisms, and data backup procedures are crucial. Overall, a proactive and comprehensive approach to security, incorporating established practices and frameworks, is vital for safeguarding microservice architectures and distributed systems.

Comments (0)

April 19, 2024

Effective Load Shedding and Throttling Strategies for Managing Traffic Spikes and DDoS Attacks

Filed under: Design — Tags: architecture — admin @ 9:48 pm

Online services experiencing rapid growth often encounter abrupt surges in traffic and may become targets of Distributed Denial of Service (DDoS) attacks orchestrated by malicious actors or inadvertently due to self-induced bugs. Mitigating these challenges to ensure high availability requires meticulous architectural practices, including implementing caching mechanisms, leveraging Content Delivery Networks (CDNs), Web Application Firewalls (WAFs), deploying queuing systems, employing load balancing strategies, implementing robust monitoring and alerting systems, and incorporating autoscaling capabilities. However, in this context, we will focus specifically on techniques related to load shedding and throttling to manage various traffic shapes effectively.

1. Traffic Patterns and Shapes

Traffic patterns refer to the manner in which user requests or tasks interact with your online service throughout a given period. These requests or tasks can vary in characteristics, including the rate of requests (TPS), concurrency, and the patterns of request flow, such as bursts of traffic. These patterns must be analyzed for scaling your service effectively and providing high availability.

Here’s a breakdown of some common traffic shapes:

Normal Traffic: defines baseline level of traffic pattern that a service receives most of the time based on regular user activity.
Peak Traffic: defines recurring period of high traffic based on daily or weekly user activity patterns. Auto-scaling rules can be set up to automatically allocate pre-provisioned additional resources in response to anticipated peaks in traffic.
Off-Peak Traffic: refers to periods of low or minimal traffic, such as during late-night hours or weekends. Auto-scaling rules can be set to scale down or consolidating resources during periods of low demand help minimize operational costs while maintaining adequate performance levels.
Burst Traffic: defines sudden, short-lived spikes in traffic that might be caused by viral contents or promotional campaigns. Auto-scaling rules can be configured to allocate extra resources in reaction to burst traffic. However, scaling resources might not happen swiftly enough to match the duration of the burst traffic. Therefore, it’s typically recommended to maintain surplus capacity to effectively handle burst traffic situations.
Seasonal Traffic: defines traffic patterns based on specific seasons, holidays or events such as Black Friday or back-to-school periods. This requires strategies similar to peak traffic for allocating pre-provisioned additional resources.
Steady Growth: defines gradual and consistent increase in traffic over time based on organic growth or marketing campaigns. This requires proactive monitoring to ensure resources keep pace with demand.

Classifying Requests

Incoming requests or tasks can be identified and categorized based on various contextual factors, such as the identity of the requester, the specific operation being requested, or other relevant parameters. This classification enables the implementation of appropriate measures, such as throttling or load shedding policies, to manage the flow of requests effectively.

Additional Considerations:

Traffic Patterns Can Combine: Real-world traffic patterns are often a combination of these shapes, requiring flexible and adaptable scaling strategies.
Monitoring and Alerting: Continuously monitor traffic patterns to identify trends early and proactively adjust your scaling strategy. Set up alerts and notifications to inform about sudden traffic surges or potential DDoS attacks so you can take timely action.
Incident Response Plan: Develop a well-defined incident response plan that outlines the steps for communication protocols, mitigation strategies, engaging stakeholders, and recovery procedures.
Cost-Effectiveness: Balance scaling needs with cost optimization to avoid over-provisioning resources during low traffic periods.

2. Throttling and Rate Limiting

Throttling controls the rate of traffic flow or resource consumption within a system to prevent overload or degradation of service. Throttling enforces quota limits and protects system overload by limiting the amount of resources (CPU, memory, network bandwidth) a single user or client can consume within a specific time frame. Throttling ensures efficient resource utilization, allowing the service to handle more users in a predictable manner. This ensures better fairness and stability while preventing a noisy neighbor problem where unpredictable spikes or slowdowns caused by heavy users. Throttling can be implemented by API Rate Limiting on the number of API requests a client can make with a given time window; by limiting maximum bandwidth allowed for various network traffic; by limiting rate of read/write; or by limiting the number of concurrent connections for a server to prevent overload.

These throttling and rate limiting measures can be applied to both anonymous and authenticated requests as follows:

Anonymous Requests:
- Rate limiting: Implement rate limiting based on client IP addresses or other identifiers within a specific time window, preventing clients from overwhelming the system.
- Concurrency limits: Set limits on the maximum number of concurrent connections or requests that can be processed simultaneously.
- Server-side throttling: Apply throttling mechanisms at the server level, such as queue-based rate limiting or token bucket algorithms, to control the overall throughput of incoming requests.
Authenticated Requests:
- User-based rate limiting: Implement rate limiting based on user identities or API keys, ensuring that authenticated users cannot exceed specified request limits.
- Prioritized throttling: Apply different throttling rules or limits based on user roles, subscription tiers, or other criteria, allowing higher priority requests to be processed first during peak loads.
- Circuit breakers: Implement circuit breakers to temporarily disable or throttle load from specific services or components that are experiencing high latency or failures, preventing cascading failures.

2.1 Error Response and Headers

When a request exceeds the rate limit, the server typically returns a 429 HTTP status code indicating that the request has been throttled or rate-limited due to Too Many Requests. The server may also return HTTP headers such as Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, and X-RateLimit-Resource.

3. Load Shedding

Load shedding is used to prioritize and manage system resources during periods of high demand or overload. It may discard or defer non-critical tasks or requests to ensure the continued operation of essential functions. Load shedding helps maintain system stability and prevents cascading failures by reallocating resources to handle the most critical tasks first. Common causes of unexpected events that require shedding to prevent overloading system resources include:

Traffic Spikes: sudden and significant increases in the volume of incoming traffic due to various reasons, such as viral content, marketing campaigns, sudden popularity, or events.
DDoS (Distributed Denial of Service): deliberate attempts to disrupt the normal functioning of a targeted server, service, or network by overwhelming it with a flood of traffic. A DDoS attack can be orchestrated by an attacker who commands a vast botnet comprising thousands of compromised devices, including computers, IoT devices, or servers. Additionally, misconfigurations, software bugs, or unforeseen interactions among system components such as excessive retries without exponential delays that can also lead to accidental DDoS attacks.

Here is how excessive load for anonymous and authenticated requests can be shed:

Anonymous Requests: Drop requests during extreme load conditions or when server capacity is reached, drop a percentage of incoming requests to protect the system from overload. This can be done randomly or based on specific criteria such as request types, and headers. Alternatively, service can degrade non-critical features or functionalities temporarily to reduce the overall system load and prioritize essential services.
Authenticated Requests: Apply load shedding rules based on user roles, subscription tiers, or other criteria, prioritizing requests from high-value users or critical services.

3.1 Error Response

When a request exceeds the rate limit, the server typically returns a 503 HTTP status code indicating that the request has been throttled or rate-limited due to Too Many Requests. The server may also return HTTP headers such as Retry-After, other headers specifically employed for throttling are less prevalent in the context of load shedding. Unlike throttling errors, which fall under user-errors with 4XX error codes, load shedding is categorized as a server error with 5XX error codes. Consequently, load shedding requires more aggressive monitoring and alerting compared to throttling errors. Throttling errors, on the other hand, can be considered expected behavior as a means to address noisy neighbor problems and maintain high availability.

4. Additional Techniques for Throttling and Load Shedding

Throttling, rate-limiting and load shedding measures described above can be used to handle high traffic and to prevent resource exhaustion in distributed systems. Here are common techniques that can be used to implement these measures:

Admission Control: Set up thresholds for maximum concurrent requests or resource utilization.
Request Classification and Prioritization: Classify requests based on priority, user type, or criticality and then dropping low-priority requests when the thresholds for capacity are exceeded.
Backpressure and Queue Management: Use a fixed-length queues to buffer incoming requests during high loads and applying back-pressure by rejecting requests when queues reach their limits.
Fault Isolation and Containment: Partition the system into isolated components or cells to limit the blast radius of failures.
Redundancy and Failover: Build redundancy into your infrastructure and implement failover mechanisms to ensure that your services remain available even if parts of your infrastructure are overwhelmed.
Simplicity and Modularity: Design systems with simple, modular components that can be easily understood, maintained, and replaced. Avoid complex dependencies and tight coupling between components.
Circuit Breaker: Monitor the health and performance of downstream services or components and stop forwarding requests if a service is overloaded or unresponsive. Periodically attempt to re-establish the connection (close the circuit breaker).
Noisy Neighbors: Throttle and apply rate limits to customer traffic to prevent them from consuming resources excessively, thereby ensuring fair access for all customers.
Capacity Planning and Scaling: Continuously monitor resource utilization and plan for capacity growth. Implement auto-scaling mechanisms to dynamically adjust resources based on demand.
Communication Optimization: Employ communication optimization techniques like compression, quantization to minimize network traffic and bandwidth requirements.
Privacy and Security Considerations: Incorporate privacy-preserving mechanisms like secure aggregation, differential privacy, and secure multi-party computation to ensure data privacy and model confidentiality.
Graceful Degradation: Identify and disable non-critical features or functionality during high loads.
Monitoring and Alerting: Monitor system metrics (CPU, memory, request rates, latency, etc.) to detect overload scenarios and sending alerts when thresholds are exceeded.
Defense in Depth: Implement multi-layered defense strategy to detect, mitigate, and protect customer workloads from malicious attacks, like blacklisting IP addresses or employing Geo-location filters, at the Edge Layer using CDN, Load Balancer, or API Gateway. Constrain network bandwidth and requests per second (RPS) for individual tenants at the Network Layer. Applying resource quota, prioritization and admission control at the Application Layer based on account information, request attributes and system metrics. Isolating tenants’ data in separate partitions at the Storage Layer. Each dependent service may use similar multi-layered defense to throttle based on the usage patterns and resource constraints.
Adaptive Scaling: Automatically scale resources up or down based on demand and multi-tenant fairness policies. Employ predictive auto-scaling or load-based scaling.
Fault Tolerance and Checkpointing: Incorporate fault tolerance mechanisms, redundant computation and checkpointing to ensure reliable and resilient task processing in the face of potential resource failures. The fault tolerance mechanisms can be used to handle potential failures or stragglers (slow or unresponsive devices).
Web Application Firewall (WAF): Inspects incoming traffic and blocks malicious requests, including DDoS attacks, based on predefined rules and patterns.
Load Balancing: By distributing incoming traffic across multiple servers or instances, load balancing helps prevent any single server from becoming overwhelmed.
Content Delivery Network (CDN): Distribute your content across multiple geographic locations, reducing the strain on your origin servers.
Cost-Aware Scaling: Implements a cost-aware scaling strategy like like cost modeling and performance prediction that considers the cost of different resource types.
Security Mechanisms: Incorporate various security mechanisms such as secure communication channels, code integrity verification, and runtime security monitoring to protect against potential vulnerabilities and attacks in multi-tenant environments.
SOPs and Run books: Develop well-defined procedures that outlines the steps for detecting traffic spikes, pinpointing source of malicious attack, analyzing the logs and monitoring metrics, mitigation strategies, engaging stakeholders, and recovery procedures.

5. Pitfalls with Use of Throttling and Load Shedding

Here are some potential challenges to consider when implementing throttling and load shedding:

Autoscaling Failures: If your throttling policies are too aggressive, they may prevent your application from generating enough load to trigger autoscaling policies. This can lead to under-provisioning of resources and performance degradation. Conversely, if your throttling policies are too lenient, your application may scale up unnecessarily, leading to overspending.
Load Balancer Health Checks: Some load balancers use synthetic health checks to determine the health of backend instances. If your throttling policies block these health checks, it can cause instances to be marked as unhealthy and removed from the load balancer, even though they are still capable of serving traffic.
Unhealthy Instance Overload: When instances are marked as unhealthy by a load balancer, the remaining healthy instances may become overloaded if throttling policies are not properly configured. This can lead to a cascading failure scenario where more and more instances are marked as unhealthy due to the increased load.
Sticky Sessions: If your application uses sticky sessions (session affinity) for user sessions, and your throttling policies are not consistently applied across all instances, it can lead to inconsistent user experiences or session loss.
Cache Invalidation: Aggressive throttling or load shedding policies can lead to more frequent cache invalidations, which can impact performance and increase the load on your backend systems.
Upstream Service Overload: If your application relies on upstream services or APIs, and your throttling policies are not properly coordinated with those services, you may end up overloading those systems and causing cascading failures.
Insufficient capacity of the Failover: The failover servers must possess adequate capacity to manage the entire expected traffic load from the primary servers.
Monitoring Challenges: Throttling and load shedding policies can make it more difficult to monitor and troubleshoot performance issues, as the metrics you’re observing may be skewed by the throttling mechanisms.
Delays in Updating Throttling Policies: The policy adjustments for throttling and load shedding should be capable of updating at runtime swiftly to adapt to various traffic patterns..
Balancing Load based on number of connections: When directing incoming traffic based on the host with the least number of connections, there’s a risk of unhealthy hosts will have fewer connections due to their quick error responses. Consequently, the load balancer may direct more traffic towards these hosts, resulting in a majority of requests failing. To counteract this, it’s essential to employ robust Layer 7 health checks that comprehensively assess the application’s functionality and dependencies. Layer 4 health checks, which are susceptible to false positives, should be avoided. The unhealthy host should be removed from the available pool as quickly as possible. Additionally, ensuring that error responses from the service have similar latency to successful responses can serve as another effective mitigation strategy.

To mitigate these issues, it’s essential to carefully coordinate your throttling and load shedding policies with the autoscaling, load balancing, caching, and monitoring strategies. This may involve tuning thresholds, implementing consistent policies across all components, and closely monitoring the interaction between these systems. Additionally, it’s crucial to thoroughly test your configurations under various load conditions to identify and address potential issues before they impact your production environment.

6. Monitoring Metrics and Notifications

Here are some common metrics and alarms to consider for throttling and load shedding:

6.1 Network Traffic Metrics:

Incoming/Outgoing Bandwidth: Monitor the total network bandwidth to detect abnormal traffic patterns.
Packets per Second (PPS): Track the number of packets processed per second to identify potential DDoS attacks or traffic bursts.
Connections per Second: Monitor the rate of new connections being established to detect potential connection exhaustion or DDoS attacks.

6.2 Application Metrics:

Request Rate: Track the number of requests per second to identify traffic spikes or bursts.
Error Rate: Monitor the rate of errors or failed requests, which can indicate overloading or application issues.
Response Time: Measure the application’s response time to detect performance degradation or latency issues.
Queue Saturation: Monitor the lengths of queues or buffers to identify potential bottlenecks or resource exhaustion.

6.3 System Metrics:

CPU Utilization: Monitor CPU usage to detect resource contention or overloading.
Memory Utilization: Track memory usage to identify potential memory leaks or resource exhaustion.
Disk I/O: Monitor disk read/write operations to detect storage bottlenecks or performance issues.

6.4 Load Balancer Metrics:

Active Connections: Monitor the number of active connections to the load balancer to identify potential connection exhaustion.
Unhealthy Hosts: Track the number of unhealthy or unresponsive hosts to ensure load balancing efficiency.
Request/Response Errors: Monitor errors related to requests or responses to identify issues with backend services.

6.5 Alarms and Notifications:

Set up alarms for critical metrics, such as high CPU utilization, memory exhaustion or excessive error rates. For example, send alarms when error rate > 5% or response code of 5XX for consecutive 5 seconds or data points.
Set up alarms for high latency, e.g., P90 latency exceeds 50ms for more than 30 seconds.
Establish fine-grained alarms for detecting breaches in customer service level agreements (SLAs). Configure the alarm thresholds to trigger below the customer SLAs and ensure they can identify the affected customers.

6.6 Autoscaling Policies:

CPU Utilization-based Scaling: Scale out or in based on CPU usage thresholds to handle traffic bursts or DDoS attacks.
Memory Utilization-based Scaling: Scale resources based on memory usage to prevent memory exhaustion.
Network Traffic-based Scaling: Scale resources based on incoming or outgoing network traffic patterns to handle traffic spikes.
Request Rate-based Scaling: Scale resources based on the rate of incoming requests to maintain optimal performance.

6.7 Throttling / Load Shedding Overhead:

Monitor the processing time for throttling and load shedding, accounting for any communication overhead if the target host is unhealthy. Keep track of the time to ascertain priority, identify delays in processing, and ensure that high delays only impact denied requests.
Monitor the system’s utilization and identify when it reaches its capacity.
Monitor the observed target throughput at the time of the request.
Monitor the time taken to determine if load shedding is necessary and track when the percentage of denied traffic exceeds X% of incoming traffic.

It’s essential to tailor these metrics and alarms to your specific application, infrastructure, and traffic patterns.

7. Summary

Throttling and Load Shedding offer effective means for managing traffic for online services to maintain high availability. Traffic patterns may vary in characteristics like rate of requests, concurrency, and flow patterns. Understanding these shapes, including normal, peak, off-peak, burst, and seasonal traffic, is crucial for scaling and ensuring high availability. Requests can be classified based on contextual factors, enabling appropriate measures such as throttling or load shedding.

Throttling manages traffic flow or resource usage to avoid overload, whereas load shedding prioritizes tasks during periods of high demand. These methods can complement other strategies such as admission control, request classification, backpressure management, and redundancy. However, their implementation requires careful monitoring, notification, and thorough testing to ensure effectiveness.

8. References

Comments (0)

« Newer Posts — Older Posts »

Shahzad Bhatti Welcome to my ramblings and rants!

August 25, 2025

Beyond Vibe Coding: Using TLA+ and Executable Specifications with Claude

TL;DR: The Problem and Solution

The Vibe Coding Problem

Enter Executable Specifications

Why TLA+ with Claude?

Setting Up Your Claude and TLA+ Environment

Installing Claude Desktop

Configuring Your Workspace

Installing TLA+ Tools

VS Code Extension

Your First TLA+ Specification

Real-World Example: Task Management API

Step 1: Define the System State

Step 2: Define System Actions

Step 3: Safety and Liveness Properties

Step 4: Model Checking and Trace Generation

Step 5: Claude Implementation with TLA+ Context

Step 6: TLA+ Generated Tests

Advanced TLA+ Patterns with Claude

Modeling Concurrent Operations

Modeling Complex Business Rules

Lessons Learned

1. Start Small

2. Properties Drive Design

3. Model Checking Finds Edge Cases

4. Generated Tests Are Comprehensive

Documentation Generation

Code Review Guidelines

Comparing Specification Approaches

Tools and Resources

Essential TLA+ Resources

Common Mistakes

1. Avoid These Mistakes

2. Claude Integration Pitfalls

3. The Context Overload Trap

4. The “Fix My Test” Antipattern

5. The Blind Trust Mistake

Proven Patterns

1. Save effective prompts:

2. The “Explain First” Pattern

3. The “Progressive Enhancement” Pattern

4. The “Code Review” Pattern

What’s Next

Implications for Software Engineering

Conclusion: The End of Vibe Coding

August 16, 2025

The Complete Guide to gRPC Load Balancing in Kubernetes and Istio

TL;DR – The Test Results Matrix

Introduction: The gRPC Load Balancing Problem

Complete Test Matrix

Prerequisites

Test 1: Baseline – Local Testing

Test 2: Kubernetes Without Istio

Deploy the Service

Test Results

Why This Happens

Connection Analysis

Test 3: Kubernetes With Istio

Install Istio

Deploy With Istio

Critical Testing Gotcha

? Correct Testing Method

Test Results With Istio

How Istio Solves This

Test 4: Client-Side Load Balancing

Standard Client-Side LB

Results and Limitations

Alternative: Multiple Connections

Test 5: Advanced Connection Testing

Multiple Connection Strategy

Performance Comparison

Official Documentation References

gRPC Load Balancing

Istio’s Solution

Kubernetes Service Limitations

Complete Test Results Summary

???? Core Test Matrix Results

???? Detailed Test Scenario Analysis