Modern microservices architectures often require supporting both HTTP REST APIs and gRPC services simultaneously. While Google’s gRPC-Gateway provides HTTP and gRPC transcoding capabilities, the challenge of bidirectional header mapping between these protocols remains a common source of inconsistency, bugs, and maintenance overhead across services. This article explores the technical challenges of HTTP-gRPC header mapping, examines current approaches and their limitations, and presents a standardized middleware solution that addresses these issues.
Understanding gRPC AIP and HTTP/gRPC Transcoding
Google’s Application Programming Interface Improvement (AIP) standards define how to build consistent, intuitive APIs. For example, AIP-127: HTTP and gRPC Transcoding enables a single service implementation to serve both HTTP REST and gRPC traffic through protocol transcoding.
How gRPC-Gateway Transcoding Works
The gRPC-Gateway acts as a reverse proxy that translates HTTP requests into gRPC calls:
HTTP Client ? gRPC-Gateway ? gRPC Server
? ? ?
REST Request Proto Message gRPC Service
Following is the transcoding process:
URL Path to RPC Method: HTTP paths map to gRPC service methods
HTTP Body to Proto Message: JSON payloads become protobuf messages
Query Parameters to Fields: URL parameters populate message fields
HTTP Headers to gRPC Metadata: Headers become gRPC metadata key-value pairs
The Header Mapping Challenge
While gRPC-Gateway handles most transcoding automatically, header mapping requires explicit configuration. Consider this common scenario:
Without proper configuration, headers are lost, inconsistently mapped, or require custom code in each service.
Current Problems and Anti-Patterns
Problem 1: Fragmented Header Mapping Solutions
Most services implement header mapping ad-hoc:
// Service A approach
func (s *ServiceA) CreateUser(ctx context.Context, req *pb.CreateUserRequest) (*pb.User, error) {
md, _ := metadata.FromIncomingContext(ctx)
authHeader := md.Get("authorization")
userID := md.Get("x-user-id")
// ... custom mapping logic
}
// Service B approach
func (s *ServiceB) GetOrder(ctx context.Context, req *pb.GetOrderRequest) (*pb.Order, error) {
// Different header names, different extraction logic
md, _ := metadata.FromIncomingContext(ctx)
auth := md.Get("auth") // Different from Service A!
requestID := md.Get("request_id") // Different format!
}
This leads to:
Inconsistent header naming across services
Duplicated mapping logic in every service
Maintenance burden when headers change
Testing complexity due to custom implementations
Problem 2: Context Abuse and Memory Issues
I have often observed misuse of Go’s context for storing large amounts of data that puts the service at risk of being killed due to OOM:
// ANTI-PATTERN: Storing large objects in context
type UserContext struct {
User *User // Large user object
Permissions []Permission // Array of permissions
Preferences *UserPrefs // User preferences
AuditLog []AuditEntry // Historical data
}
func StoreUserInContext(ctx context.Context, user *UserContext) context.Context {
return context.WithValue(ctx, "user", user) // BAD: Large object in context
}
Why This Causes Problems:
Memory Leaks: Contexts are passed through the entire request chain and may not be garbage collected promptly
Performance Degradation: Large context objects increase allocation pressure
Goroutine Overhead: Each concurrent request carries this memory burden
Service Instability: Under load, memory usage can spike and cause OOM kills
Proper Pattern:
// GOOD: Store only identifiers in context
func StoreUserIDInContext(ctx context.Context, userID string) context.Context {
return context.WithValue(ctx, "user_id", userID) // Small string only
}
// Fetch data when needed from database/cache
func GetUserFromContext(ctx context.Context) (*User, error) {
userID := ctx.Value("user_id").(string)
return userService.GetUser(userID) // Fetch from datastore
}
Problem 3: Inconsistent Response Header Handling
Setting response headers requires different approaches across the stack:
// gRPC: Set headers via metadata
grpc.SendHeader(ctx, metadata.New(map[string]string{
"x-server-version": "v1.2.0",
}))
// HTTP: Set headers on ResponseWriter
w.Header().Set("X-Server-Version", "v1.2.0")
// gRPC-Gateway: Headers must be set in specific metadata format
grpc.SetHeader(ctx, metadata.New(map[string]string{
"grpc-metadata-x-server-version": "v1.2.0", // Prefix required
}))
This complexity leads to missing response headers and inconsistent client experiences.
Solution: Standardized Header Mapping Middleware
The solution is a dedicated middleware that handles bidirectional header mapping declaratively, allowing services to focus on business logic while ensuring consistent header handling across the entire API surface.
In large microservices architectures, inconsistent header handling creates operational overhead:
Debugging becomes difficult when services use different header names
Client libraries must handle different header formats per service
Security policies cannot be uniformly enforced
Observability suffers from inconsistent request correlation
Standardized header mapping addresses these issues by ensuring consistency across the entire service mesh.
Developer Productivity
Developers spend significant time on infrastructure concerns rather than business logic. This middleware eliminates:
Boilerplate code for header extraction and response setting
Testing complexity around header handling edge cases
Documentation overhead for service-specific header requirements
Bug investigation related to missing or malformed headers
Operational Excellence
Standard header mapping enables:
Automated monitoring with consistent request correlation
Security scanning with predictable header formats
Performance analysis across service boundaries
Compliance auditing with standardized access logging
Conclusion
HTTP and gRPC transcoding is a powerful pattern for modern APIs, but header mapping complexity has been a persistent challenge. The gRPC Header Mapper middleware presented in this article provides a solution that enables true bidirectional header mapping between HTTP and gRPC protocols.
Eliminate inconsistencies across services with bidirectional header mapping
Reduce maintenance burden through centralized configuration
Improve reliability by avoiding context misuse and memory leaks
Enhance developer productivity by removing boilerplate code
Support complex transformations with built-in and custom transformation functions
The middleware’s bidirectional mapping capability means that headers flow seamlessly in both directions – HTTP requests to gRPC metadata for service processing, and gRPC metadata back to HTTP response headers for client consumption. This eliminates the common problem where request headers are available to services but response headers are lost or inconsistently handled.
Problem: AI-assisted coding fails when modifying existing systems because we give AI vague specifications.
Solution: Use TLA+ formal specifications as precise contracts that Claude can implement reliably.
Result: Transform Claude from a code generator into a reliable engineering partner that reasons about complex systems.
After months of using Claude for development, I discovered most AI-assisted coding fails not because the AI isn’t smart enough, but because we’re asking it to work from vague specifications. This post shows you how to move beyond “vibe coding” using executable specifications that turn Claude into a reliable engineering partner.
Here’s what changes when you use TLA+ with Claude:
Before (Vibe Coding):
“Create a task management API”
Claude guesses at requirements
Inconsistent behavior across edge cases
Bugs in corner cases
After (TLA+ Specifications):
Precise mathematical specification
Claude implements exactly what you specified
All edge cases defined upfront
Properties verified before deployment
The Vibe Coding Problem
AI assistants like Claude are primarily trained on greenfield development patterns. They excel at:
Writing new functions from scratch
Implementing well-known algorithms
Creating boilerplate code
But they struggle with:
Understanding implicit behavioral contracts in existing code
Maintaining invariants across system modifications
The solution isn’t better prompts – it’s better specifications.
Enter Executable Specifications
An executable specification is a formal description of system behavior that can be:
Verified – Checked for logical consistency
Validated – Tested against real-world scenarios
Executed – Run to generate test cases or even implementations
I’ve tried many approaches to precise specifications over the years:
UML and Model Driven Development (2000s-2010s): I used tools like Rational Rose and Visual Paradigm in early 2000s that promised complete code generation from UML models. The reality was different:
Visual complexity: UML diagrams became unwieldy for anything non-trivial
Tool lock-in: Proprietary formats and expensive tooling
Impedance mismatch: The gap between UML models and real code was huge
Maintenance nightmare: Keeping models and code synchronized was nearly impossible
Model checking: Explores all possible execution paths
Tool independence: Plain text specifications, open source tools
Behavioral focus: Designed specifically for concurrent and distributed systems
Why TLA+ with Claude?
The magic happens when you combine TLA+’s precision with Claude’s implementation capabilities:
TLA+ eliminates ambiguity – There’s only one way to interpret a formal specification
Claude can read TLA+ – It understands the formal syntax and can translate it to code
Verification catches design flaws – TLA+ model checking finds edge cases you’d miss
Generated traces become tests – TLA+ execution paths become your test suite
Setting Up Your Claude and TLA+ Environment
Installing Claude Desktop
First, let’s get Claude running on your machine:
# Install via Homebrew (macOS)
brew install --cask claude
# Or download directly from Anthropic
# https://claude.ai/download
Set up project-specific contexts in ~/.claude/
Create TLA+ syntax rules for better code generation
Configure memory settings for specification patterns
Configuring Your Workspace
Once installed, I recommend creating a dedicated workspace structure. Here’s what works for me:
# Create a Claude workspace directory
mkdir -p ~/claude-workspace/{projects,templates,context}
# Add a context file for your coding standards
cat > ~/claude-workspace/context/coding-standards.md << 'EOF'
# My Coding Standards
- Use descriptive variable names
- Functions should do one thing well
- Write tests for all new features
- Handle errors explicitly
- Document complex logic
EOF
Installing TLA+ Tools
Choose based on your workflow
GUI users: TLA+ Toolbox for visual model checking
CLI users: tla2tools.jar for CI integration
Both: VS Code extension for syntax highlighting
# Download TLA+ Tools from https://github.com/tlaplus/tlaplus/releases
# Or use Homebrew on macOS
brew install --cask tla-plus-toolbox
# For command-line usage (recommended for CI)
wget https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar
VS Code Extension
Install the TLA+ extension for syntax highlighting and basic validation:
code --install-extension alygin.vscode-tlaplus
Your First TLA+ Specification
Let’s start with a simple example to understand the syntax:
--------------------------- MODULE TaskManagement ---------------------------
EXTENDS Integers, Sequences, FiniteSets, TLC
CONSTANTS
Users, \* Set of users
MaxTasks, \* Maximum number of tasks
MaxTime, \* Maximum time value for simulation
Titles, \* Set of possible task titles
Descriptions \* Set of possible task descriptions
VARIABLES
tasks, \* Function from task ID to task record
userTasks, \* Function from user ID to set of task IDs
nextTaskId, \* Counter for generating unique task IDs
currentUser, \* Currently authenticated user
clock, \* Global clock for timestamps
sessions \* Active user sessions
\* Task states enumeration with valid transitions
TaskStates == {"pending", "in_progress", "completed", "cancelled", "blocked"}
\* Priority levels
Priorities == {"low", "medium", "high", "critical"}
\* Valid state transitions
ValidTransitions == {
<<"pending", "in_progress">>,
<<"pending", "cancelled">>,
<<"pending", "blocked">>,
<<"in_progress", "completed">>,
<<"in_progress", "cancelled">>,
<<"in_progress", "blocked">>,
<<"in_progress", "pending">>, \* Allow reverting to pending
<<"blocked", "pending">>,
<<"blocked", "in_progress">>,
<<"blocked", "cancelled">>
}
TaskRecord == [
id: Nat,
title: STRING,
description: STRING,
status: TaskStates,
priority: {"low", "medium", "high"},
assignee: Users,
createdAt: Nat,
dueDate: Nat \cup {NULL}
]
\* Type invariants
TypeInvariant ==
/\ tasks \in [Nat -> TaskRecord]
/\ userTasks \in [Users -> SUBSET Nat]
/\ nextTaskId \in Nat
/\ currentUser \in Users \cup {NULL}
Now comes the magic. Instead of vague natural language, we give Claude the precise TLA+ specification:
My prompt to Claude:
Using this TLA+ specification for a task management system, implement a Go REST API
that satisfies all the specified behaviors and properties.
[Include the full TLA+ specification here]
Requirements:
1. Implement all actions defined in the TLA+ spec
2. Ensure all safety properties hold at runtime
3. Use Go's type system to enforce the data structures
4. Include comprehensive tests that cover the execution traces
5. Add proper error handling for violation of preconditions
Focus on behavioral correctness over performance optimization.
// Package domain contains the core business entities matching the TLA+ specification
package domain
import (
"fmt"
"time"
)
// TaskID represents a unique task identifier (maps to TLA+ task ID)
type TaskID int
// UserID represents a user identifier (maps to TLA+ Users set)
type UserID string
// TaskStatus represents the state of a task (maps to TLA+ TaskStates)
type TaskStatus string
const (
StatusPending TaskStatus = "pending"
StatusInProgress TaskStatus = "in_progress"
StatusCompleted TaskStatus = "completed"
StatusCancelled TaskStatus = "cancelled"
StatusBlocked TaskStatus = "blocked"
)
// Priority represents task priority levels (maps to TLA+ Priorities)
type Priority string
const (
PriorityLow Priority = "low"
PriorityMedium Priority = "medium"
PriorityHigh Priority = "high"
PriorityCritical Priority = "critical"
)
// Tag represents task categories (maps to TLA+ tags subset)
type Tag string
const (
TagBug Tag = "bug"
TagFeature Tag = "feature"
TagEnhancement Tag = "enhancement"
TagDocumentation Tag = "documentation"
)
// Task represents a task entity (maps to TLA+ task record)
type Task struct {
ID TaskID `json:"id"`
Title string `json:"title"`
Description string `json:"description"`
Status TaskStatus `json:"status"`
Priority Priority `json:"priority"`
Assignee UserID `json:"assignee"`
CreatedBy UserID `json:"created_by"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
DueDate *time.Time `json:"due_date,omitempty"`
Tags []Tag `json:"tags"`
Dependencies map[TaskID]bool `json:"dependencies"`
}
// ValidTransition represents a valid state transition (maps to TLA+ ValidTransitions)
type ValidTransition struct {
From TaskStatus
To TaskStatus
}
// ValidTransitions defines all allowed state transitions
var ValidTransitions = map[ValidTransition]bool{
{StatusPending, StatusInProgress}: true,
{StatusPending, StatusCancelled}: true,
{StatusPending, StatusBlocked}: true,
{StatusInProgress, StatusCompleted}: true,
{StatusInProgress, StatusCancelled}: true,
{StatusInProgress, StatusBlocked}: true,
{StatusInProgress, StatusPending}: true, // Allow reverting
{StatusBlocked, StatusPending}: true,
{StatusBlocked, StatusInProgress}: true,
{StatusBlocked, StatusCancelled}: true,
}
// IsValidTransition checks if a state transition is valid (maps to TLA+ IsValidTransition)
func IsValidTransition(from, to TaskStatus) bool {
return ValidTransitions[ValidTransition{From: from, To: to}]
}
// CanDelete checks if a task can be deleted (only completed or cancelled)
func (t *Task) CanDelete() bool {
return t.Status == StatusCompleted || t.Status == StatusCancelled
}
// IsBlocked checks if task should be blocked based on dependencies
func (t *Task) IsBlocked(allTasks map[TaskID]*Task) bool {
if len(t.Dependencies) == 0 {
return false
}
for depID := range t.Dependencies {
if dep, exists := allTasks[depID]; exists {
if dep.Status != StatusCompleted {
return true
}
}
}
return false
}
// ShouldUnblock checks if a blocked task can be unblocked
func (t *Task) ShouldUnblock(allTasks map[TaskID]*Task) bool {
if t.Status != StatusBlocked {
return false
}
for depID := range t.Dependencies {
if dep, exists := allTasks[depID]; exists {
if dep.Status != StatusCompleted {
return false
}
}
}
return true
}
// Validate performs domain validation on the task
func (t *Task) Validate() error {
if t.Title == "" {
return fmt.Errorf("task title cannot be empty")
}
if t.Description == "" {
return fmt.Errorf("task description cannot be empty")
}
if !isValidStatus(t.Status) {
return fmt.Errorf("invalid task status: %s", t.Status)
}
if !isValidPriority(t.Priority) {
return fmt.Errorf("invalid task priority: %s", t.Priority)
}
if t.Assignee == "" {
return fmt.Errorf("task must have an assignee")
}
if t.CreatedBy == "" {
return fmt.Errorf("task must have a creator")
}
if t.CreatedAt.After(t.UpdatedAt) {
return fmt.Errorf("created time cannot be after updated time")
}
for _, tag := range t.Tags {
if !isValidTag(tag) {
return fmt.Errorf("invalid tag: %s", tag)
}
}
return nil
}
func isValidStatus(status TaskStatus) bool {
switch status {
case StatusPending, StatusInProgress, StatusCompleted, StatusCancelled, StatusBlocked:
return true
default:
return false
}
}
func isValidPriority(priority Priority) bool {
switch priority {
case PriorityLow, PriorityMedium, PriorityHigh, PriorityCritical:
return true
default:
return false
}
}
func isValidTag(tag Tag) bool {
switch tag {
case TagBug, TagFeature, TagEnhancement, TagDocumentation:
return true
default:
return false
}
}
// Package usecase implements the TLA+ actions as use cases
package usecase
import (
"crypto/rand"
"encoding/hex"
"fmt"
"time"
"github.com/bhatti/sample-task-management/internal/domain"
"github.com/bhatti/sample-task-management/internal/repository"
)
// TaskUseCase implements task-related TLA+ actions
type TaskUseCase struct {
uow repository.UnitOfWork
invariantChecker InvariantChecker
}
// InvariantChecker interface for runtime invariant validation
type InvariantChecker interface {
CheckAllInvariants(state *domain.SystemState) error
CheckTaskInvariants(task *domain.Task, state *domain.SystemState) error
CheckTransitionInvariant(from, to domain.TaskStatus) error
}
// NewTaskUseCase creates a new task use case
func NewTaskUseCase(uow repository.UnitOfWork, checker InvariantChecker) *TaskUseCase {
return &TaskUseCase{
uow: uow,
invariantChecker: checker,
}
}
// Authenticate implements TLA+ Authenticate action
func (uc *TaskUseCase) Authenticate(userID domain.UserID) (*domain.Session, error) {
// Preconditions from TLA+:
// - user \in Users
// - ~sessions[user]
user, err := uc.uow.Users().GetUser(userID)
if err != nil {
return nil, fmt.Errorf("user not found: %w", err)
}
// Check if user already has an active session
existingSession, _ := uc.uow.Sessions().GetSessionByUser(userID)
if existingSession != nil && existingSession.IsValid() {
return nil, fmt.Errorf("user %s already has an active session", userID)
}
// Create new session
token := generateToken()
session := &domain.Session{
UserID: user.ID,
Token: token,
Active: true,
CreatedAt: time.Now(),
ExpiresAt: time.Now().Add(24 * time.Hour),
}
// Update state
if err := uc.uow.Sessions().CreateSession(session); err != nil {
return nil, fmt.Errorf("failed to create session: %w", err)
}
if err := uc.uow.SystemState().SetCurrentUser(&userID); err != nil {
return nil, fmt.Errorf("failed to set current user: %w", err)
}
// Check invariants
state, _ := uc.uow.SystemState().GetSystemState()
if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
uc.uow.Rollback()
return nil, fmt.Errorf("invariant violation: %w", err)
}
return session, nil
}
// CreateTask implements TLA+ CreateTask action
func (uc *TaskUseCase) CreateTask(
title, description string,
priority domain.Priority,
assignee domain.UserID,
dueDate *time.Time,
tags []domain.Tag,
dependencies []domain.TaskID,
) (*domain.Task, error) {
// Preconditions from TLA+:
// - currentUser # NULL
// - currentUser \in Users
// - nextTaskId <= MaxTasks
// - deps \subseteq DOMAIN tasks
// - \A dep \in deps : tasks[dep].status # "cancelled"
currentUser, err := uc.uow.SystemState().GetCurrentUser()
if err != nil || currentUser == nil {
return nil, fmt.Errorf("authentication required")
}
// Check max tasks limit
nextID, err := uc.uow.SystemState().GetNextTaskID()
if err != nil {
return nil, fmt.Errorf("failed to get next task ID: %w", err)
}
if nextID > domain.MaxTasks {
return nil, fmt.Errorf("maximum number of tasks (%d) reached", domain.MaxTasks)
}
// Validate dependencies
allTasks, err := uc.uow.Tasks().GetAllTasks()
if err != nil {
return nil, fmt.Errorf("failed to get tasks: %w", err)
}
depMap := make(map[domain.TaskID]bool)
for _, depID := range dependencies {
depTask, exists := allTasks[depID]
if !exists {
return nil, fmt.Errorf("dependency task %d does not exist", depID)
}
if depTask.Status == domain.StatusCancelled {
return nil, fmt.Errorf("cannot depend on cancelled task %d", depID)
}
depMap[depID] = true
}
// Check for cyclic dependencies
if err := uc.checkCyclicDependencies(nextID, depMap, allTasks); err != nil {
return nil, err
}
// Determine initial status based on dependencies
status := domain.StatusPending
if len(dependencies) > 0 {
// Check if all dependencies are completed
allCompleted := true
for depID := range depMap {
if allTasks[depID].Status != domain.StatusCompleted {
allCompleted = false
break
}
}
if !allCompleted {
status = domain.StatusBlocked
}
}
// Create task
task := &domain.Task{
ID: nextID,
Title: title,
Description: description,
Status: status,
Priority: priority,
Assignee: assignee,
CreatedBy: *currentUser,
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
DueDate: dueDate,
Tags: tags,
Dependencies: depMap,
}
// Validate task
if err := task.Validate(); err != nil {
return nil, fmt.Errorf("task validation failed: %w", err)
}
// Save task
if err := uc.uow.Tasks().CreateTask(task); err != nil {
return nil, fmt.Errorf("failed to create task: %w", err)
}
// Increment next task ID
if _, err := uc.uow.SystemState().IncrementNextTaskID(); err != nil {
return nil, fmt.Errorf("failed to increment task ID: %w", err)
}
// Check invariants
state, _ := uc.uow.SystemState().GetSystemState()
if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
uc.uow.Rollback()
return nil, fmt.Errorf("invariant violation after task creation: %w", err)
}
return task, nil
}
// UpdateTaskStatus implements TLA+ UpdateTaskStatus action
func (uc *TaskUseCase) UpdateTaskStatus(taskID domain.TaskID, newStatus domain.TaskStatus) error {
// Preconditions from TLA+:
// - currentUser # NULL
// - TaskExists(taskId)
// - taskId \in GetUserTasks(currentUser)
// - IsValidTransition(tasks[taskId].status, newStatus)
// - newStatus = "in_progress" => all dependencies completed
currentUser, err := uc.uow.SystemState().GetCurrentUser()
if err != nil || currentUser == nil {
return fmt.Errorf("authentication required")
}
task, err := uc.uow.Tasks().GetTask(taskID)
if err != nil {
return fmt.Errorf("task not found: %w", err)
}
// Check user owns the task
userTasks, err := uc.uow.SystemState().GetUserTasks(*currentUser)
if err != nil {
return fmt.Errorf("failed to get user tasks: %w", err)
}
hasTask := false
for _, id := range userTasks {
if id == taskID {
hasTask = true
break
}
}
if !hasTask {
return fmt.Errorf("user does not have access to task %d", taskID)
}
// Check valid transition
if !domain.IsValidTransition(task.Status, newStatus) {
return fmt.Errorf("invalid transition from %s to %s", task.Status, newStatus)
}
// Check dependencies if moving to in_progress
if newStatus == domain.StatusInProgress {
allTasks, _ := uc.uow.Tasks().GetAllTasks()
for depID := range task.Dependencies {
if depTask, exists := allTasks[depID]; exists {
if depTask.Status != domain.StatusCompleted {
return fmt.Errorf("cannot start task: dependency %d is not completed", depID)
}
}
}
}
// Update status
task.Status = newStatus
task.UpdatedAt = time.Now()
if err := uc.uow.Tasks().UpdateTask(task); err != nil {
return fmt.Errorf("failed to update task: %w", err)
}
// Check invariants
state, _ := uc.uow.SystemState().GetSystemState()
if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
uc.uow.Rollback()
return fmt.Errorf("invariant violation: %w", err)
}
return nil
}
...
Step 6: TLA+ Generated Tests
The real power comes when we use TLA+ execution traces to generate comprehensive tests:
My prompt to Claude:
Generate Go tests that verify the implementation satisfies the TLA+ specification.
Create test cases that:
1. Test all TLA+ actions with valid preconditions
2. Test safety property violations
3. Test edge cases from the TLA+ model boundary conditions
4. Use property-based testing where appropriate
Include tests that would catch the execution traces TLA+ model checker explores.
Graduate To: Multi-service interactions, complex business logic
2. Properties Drive Design
Writing TLA+ properties often reveals design flaws before implementation:
\* This property might fail, revealing a design issue
ConsistencyProperty ==
\A user \in Users:
\A taskId \in userTasks[user]:
/\ taskId \in DOMAIN tasks
/\ tasks[taskId].assignee = user
/\ tasks[taskId].status # "deleted" \* Soft delete consideration
3. Model Checking Finds Edge Cases
TLA+ model checking explores execution paths you’d never think to test:
# TLA+ finds this counterexample:
# Step 1: User1 creates Task1
# Step 2: User1 deletes Task1
# Step 3: User2 creates Task2 (gets same ID due to reuse)
# Step 4: User1 tries to update Task1 -> Security violation!
This led to using UUIDs instead of incrementing integers for task IDs.
4. Generated Tests Are Comprehensive
TLA+ execution traces become your regression test suite. When Claude implements based on TLA+ specs, you get:
Complete coverage – All specification paths tested
Edge case detection – Boundary conditions from model checking
Behavioral contracts – Tests verify actual system properties
Documentation Generation
Prompt to Claude:
Generate API documentation from this TLA+ specification that includes:
1. Endpoint descriptions derived from TLA+ actions
2. Request/response schemas from TLA+ data structures
3. Error conditions from TLA+ preconditions
4. Behavioral guarantees from TLA+ properties
Code Review Guidelines
With TLA+ specifications, code reviews become more focused:
? Asking Claude to “fix the TLA+ to match the code”
The spec is the truth – fix the code to match the spec
? Asking Claude to “implement this TLA+ specification correctly”
? Specification scope creep: Starting with entire system architecture ? Incremental approach: Begin with one core workflow, expand gradually
2. Claude Integration Pitfalls
? “Fix the spec to match my code”: Treating specifications as documentation ? “Fix the code to match the spec”: Specifications are the source of truth
3. The Context Overload Trap
Problem: Dumping too much information at once Solution: Break complex features into smaller, focused requests
4. The “Fix My Test” Antipattern
Problem: When tests fail, asking Claude to modify the test instead of the code Solution: Always fix the implementation, not the test (unless the test is genuinely wrong)
5. The Blind Trust Mistake
Problem: Accepting generated code without understanding it Solution: Always review and understand the code before committing
Proven Patterns
1. Save effective prompts:
# ~/.claude/tla-prompts/implementation.md
Implement [language] code that satisfies this TLA+ specification:
[SPEC]
Requirements:
- All TLA+ actions become functions/methods
- All preconditions become runtime checks
- All data structures match TLA+ types
- Include comprehensive tests covering specification traces
Before asking Claude to implement something complex, I ask for an explanation:
Explain how you would implement real-time task updates using WebSockets.
What are the trade-offs between Socket.io and native WebSockets?
What state management challenges should I consider?
3. The “Progressive Enhancement” Pattern
Start simple, then add complexity:
1. First: "Create a basic task model with CRUD operations"
2. Then: "Add validation and error handling"
3. Then: "Add authentication and authorization"
4. Finally: "Add real-time updates and notifications"
4. The “Code Review” Pattern
After implementation, I ask Claude to review its own code:
Review the task API implementation for:
- Security vulnerabilities
- Performance issues
- Code style consistency
- Missing error cases
Be critical and suggest improvements.
What’s Next
As I’ve developed this TLA+/Claude workflow, I’ve realized we’re approaching something profound: specifications as the primary artifact. Instead of writing code and hoping it’s correct, we’re defining correct behavior formally and letting AI generate the implementation. This inverts the traditional relationship between specification and code.
Implications for Software Engineering
Design-first development becomes natural
Bug prevention replaces bug fixing
Refactoring becomes re-implementation from stable specs
Documentation is always up-to-date (it’s the spec)
I’m currently experimenting with:
TLA+ to test case generation – Automated comprehensive testing
Multi-language implementations – Same spec, different languages
Specification composition – Building larger systems from verified components
Quint specifications – A modern executable specification language with simpler syntax than TLA+
Conclusion: The End of Vibe Coding
After using TLA+ with Claude, I can’t go back to vibe coding. The precision, reliability, and confidence that comes from executable specifications has transformed how I build software. The complete working example—TLA+ specs, Go implementation, comprehensive tests, and CI/CD pipeline—is available at github.com/bhatti/sample-task-management.
Yes, there’s a learning curve. Yes, writing TLA+ specifications takes time upfront. But the payoff—in terms of correctness, maintainability, and development speed—is extraordinary. Claude becomes not just a code generator, but a reliable engineering partner that can reason about complex systems precisely because we’ve given it precise specifications to work from. We’re moving from “code and hope” to “specify and know”—and that changes everything.
When you deploy a gRPC service in Kubernetes with multiple replicas, you expect load balancing. You won’t get it. This guide tests every possible configuration to prove why, and shows exactly how to fix it. According to the official gRPC documentation:
“gRPC uses HTTP/2, which multiplexes multiple calls on a single TCP connection. This means that once the connection is established, all gRPC calls will go to the same backend.”
git clone https://github.com/bhatti/grpc-lb-test
cd grpc-lb-test
# Build all components
make build
Test 1: Baseline – Local Testing
Purpose: Establish baseline behavior with a single server.
# Terminal 1: Start local server
./bin/server
# Terminal 2: Test with basic client
./bin/client -target localhost:50051 -requests 50
Expected Result:
? Load Distribution Results:
Server: unknown-1755316152
Pod: unknown (IP: unknown)
Requests: 50 (100.0%)
????????????????????
? Total servers hit: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.
Analysis: This confirms our client implementation works correctly and establishes the baseline.
Test 2: Kubernetes Without Istio
Purpose: Prove that standard Kubernetes doesn’t provide gRPC request-level load balancing.
Deploy the Service
# Deploy 5 replicas without Istio
./scripts/test-without-istio.sh
???? Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-gh5z5-1755316388
Pod: grpc-echo-server-5b657689db-gh5z5 (IP: 10.1.4.148)
Requests: 30 (100.0%)
????????????????????
???? Total servers hit: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.
???? Connection Analysis:
Without Istio, gRPC maintains a single TCP connection to the Kubernetes Service IP.
The kube-proxy performs L4 load balancing, but gRPC reuses the same connection.
???? Cleaning up...
deployment.apps "grpc-echo-server" deleted
service "grpc-echo-service" deleted
./scripts/test-without-istio.sh: line 57: 17836 Terminated: 15
kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1
?? RESULT: No load balancing observed - all requests went to single pod!
“For each Service, kube-proxy installs iptables rules which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service’s backend endpoints.”
? NO LOAD BALANCING: All requests to single server
???? Connection Reuse Analysis:
Average requests per connection: 1.00
?? Low connection reuse (many short connections)
? Connection analysis complete!
Test 3: Kubernetes With Istio
Purpose: Demonstrate how Istio’s L7 proxy solves the load balancing problem.
“Envoy proxies are deployed as sidecars to services, logically augmenting the services with traffic management capabilities… Envoy proxies are the only Istio components that interact with data plane traffic.”
Istio’s solution:
Envoy sidecar intercepts all traffic
Performs L7 (application-level) load balancing
Maintains connection pools to all backends
Routes each request independently
Test 4: Client-Side Load Balancing
Purpose: Test gRPC’s built-in client-side load balancing capabilities.
Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-g9pbw-1755359830
Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
Requests: 10 (100.0%)
????????????????????
???? Total servers hit: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.
? Normal client works - service is accessible
???? Test 2: Client-side round-robin (from inside cluster)
?????????????????????????????????????????????????????
Creating test pod inside cluster for proper DNS resolution...
pod "client-lb-test" deleted
./scripts/test-client-lb.sh: line 71: 48208 Terminated: 15 kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1
?? Client-side LB limitation explanation:
gRPC client-side round-robin expects multiple A records
But Kubernetes Services return only one ClusterIP
Result: 'no children to pick from' error
???? What happens with client-side LB:
1. Client asks DNS for: grpc-echo-service
2. DNS returns: 10.105.177.23 (single IP)
3. gRPC round-robin needs: multiple IPs for load balancing
4. Result: Error 'no children to pick from'
? This proves client-side LB doesn't work with K8s Services!
???? Test 3: Demonstrating the DNS limitation
?????????????????????????????????????????????
What gRPC client-side LB sees:
Service name: grpc-echo-service:50051
DNS resolution: 10.105.177.23:50051
Available endpoints: 1 (needs multiple for round-robin)
What gRPC client-side LB needs:
Multiple A records from DNS, like:
grpc-echo-service ? 10.1.4.241:50051
grpc-echo-service ? 10.1.4.240:50051
grpc-echo-service ? 10.1.4.238:50051
(But Kubernetes Services don't provide this)
???? Test 4: Alternative - Multiple connections
????????????????????????????????????????????
Testing alternative approach with multiple connections...
???? Configuration:
Target: localhost:50052
API: grpc.Dial
Load Balancing: round-robin
Multi-endpoint: true
Requests: 20
???? Using multi-endpoint resolver
???? Sending 20 unary requests...
? Request 1 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 2 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 3 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 4 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 5 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 6 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 7 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 8 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 9 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 10 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 11 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Successful requests: 20/20
???? Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-g9pbw-1755359830
Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
Requests: 20 (100.0%)
????????????????????????????????????????
???? Total unique servers: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.
This is expected for gRPC without Istio or special configuration.
? Multi-connection approach works!
(This simulates multiple endpoints for testing)
???????????????????????????????????????????????????????????????
SUMMARY
???????????????????????????????????????????????????????????????
? KEY FINDINGS:
• Standard gRPC client: Works (uses single connection)
• Client-side round-robin: Fails (needs multiple IPs)
• Kubernetes DNS: Returns single ClusterIP only
• Alternative: Multiple connections can work
???? CONCLUSION:
Client-side load balancing doesn't work with standard
Kubernetes Services because they provide only one IP address.
This proves why Istio (L7 proxy) is needed for gRPC load balancing!
Why this fails: Kubernetes Services provide a single ClusterIP, not multiple IPs for DNS resolution.
???? Detailed Load Distribution Results:
=====================================
Test Duration: 48.303709ms
Total Requests: 1000
Failed Requests: 0
Requests/sec: 20702.34
Server Distribution:
Server: unknown-1755360945
Pod: unknown (IP: unknown)
Requests: 1000 (100.0%)
First seen: 09:18:51.842
Last seen: 09:18:51.874
????????????????????????????????????????
???? Analysis:
Total unique servers: 1
Average requests per server: 1000.00
Standard deviation: 0.00
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.
This is expected behavior for gRPC without Istio.
Even sophisticated connection pooling can’t overcome the fundamental issue: • Multiple connections to SAME endpoint = same server • Advanced client techniques ? load balancing • Connection management ? request distribution
Performance Comparison
./scripts/benchmark.sh
???? Key Insights: • Single server: High performance, no load balancing • Multiple connections: Same performance, still no LB • Kubernetes: Small overhead, still no LB • Istio: Small additional overhead, but enables LB • Client-side LB: Complex setup, limited effectiveness
“Load balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we want to distribute them across all servers.”
The problem: Standard deployments don’t achieve per-call balancing.
“Istio’s data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. These proxies mediate and control all network communication between microservices.”
“kube-proxy… only supports TCP and UDP… doesn’t understand HTTP and doesn’t provide load balancing for HTTP requests.”
Complete Test Results Summary
After running comprehensive tests across all possible gRPC load balancing configurations, here are the definitive results that prove the fundamental limitations and solutions:
???? Core Test Matrix Results
Configuration
Load Balancing
Servers Hit
Distribution
Key Insight
Local gRPC
? None
1/1 (100%)
Single server
Baseline behavior confirmed
Kubernetes + gRPC
? None
1/5 (100%)
Single pod
K8s Services don’t solve it
Kubernetes + Istio
? Perfect
5/5 (20% each)
Even distribution
Istio enables true LB
Client-side LB
? Failed
1/5 (100%)
Single pod
DNS limitation fatal
kubectl port-forward + Istio
? None
1/5 (100%)
Single pod
Testing methodology matters
Advanced multi-connection
? None
1/1 (100%)
Single endpoint
Complex ? effective
???? Detailed Test Scenario Analysis
Scenario 1: Baseline Tests
Local single server: ? PASS - 50 requests ? 1 server (100%)
Local multiple conn: ? PASS - 1000 requests ? 1 server (100%)
Insight: Confirms gRPC’s connection persistence behavior. Multiple connections to same endpoint don’t change distribution.
Scenario 2: Kubernetes Standard Deployment
K8s without Istio: ? PASS - 50 requests ? 1 pod (100%)
Expected behavior: ? NO load balancing
Actual behavior: ? NO load balancing
Insight: Standard Kubernetes deployment with 5 replicas provides zero request-level load balancing for gRPC services.
Scenario 3: Istio Service Mesh
K8s with Istio (port-forward): ?? BYPASS - 50 requests ? 1 pod (100%)
K8s with Istio (in-mesh): ? SUCCESS - 50 requests ? 5 pods (20% each)
DNS round-robin: ? FAIL - "no children to pick from"
Multi-endpoint client: ? PARTIAL - Works with manual endpoint management
Advanced connections: ? FAIL - Still single endpoint limitation
Insight: Client-side solutions are complex, fragile, and limited in Kubernetes environments.
???? Deep Technical Analysis
The DNS Problem (Root Cause)
Our testing revealed the fundamental architectural issue:
# Enable for entire namespace (recommended)
kubectl label namespace production istio-injection=enabled
# Or per-deployment (more control)
metadata:
annotations:
sidecar.istio.io/inject: "true"
2. Validate Load Balancing is Working
# WRONG: This will show false negatives
kubectl port-forward service/grpc-service 50051:50051
# CORRECT: Test from inside the mesh
kubectl run test-client --rm -it --restart=Never \
--image=your-grpc-client \
--annotations="sidecar.istio.io/inject=true" \
-- ./client -target grpc-service:50051 -requests 100
# This WILL NOT load balance gRPC traffic
apiVersion: v1
kind: Service
metadata:
name: grpc-service
spec:
ports:
- port: 50051
selector:
app: grpc-server
# Result: 100% traffic to single pod (proven in our tests)
2. Don’t Use Client-Side Load Balancing
// This approach FAILS in Kubernetes (tested and failed)
conn, err := grpc.Dial(
"dns:///grpc-service:50051",
grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
// Error: "no children to pick from" (proven in our tests)
3. Don’t Implement Complex Connection Pooling
// This adds complexity without solving the core issue
type LoadBalancedClient struct {
conns []grpc.ClientConnInterface
next int64
}
// Still results in 100% traffic to single endpoint (proven in our tests)
???? Alternative Solutions (If Istio Not Available)
If you absolutely cannot use Istio, here are the only viable alternatives (with significant caveats):
Option 1: External Load Balancer with HTTP/2 Support
Error handling is often an afterthought in API development, yet it’s one of the most critical aspects of a good developer experience. For example, a cryptic error message like { "error": "An error occurred" } can lead to hours of frustrating debugging. In this guide, we will build a robust, production-grade error handling framework for a Go application that serves both gRPC and a REST/HTTP proxy based on industry standards like RFC9457 (Problem Details for HTTP APIs) and RFC7807 (obsoleted).
Tenets
Following are tenets of a great API error:
Structured: machine-readable, not just a string.
Actionable: explains the developer why the error occurred and, if possible, how to fix it.
Consistent: all errors, from validation to authentication to server faults, follow the same format.
Secure: never leaks sensitive internal information like stack traces or database schemas.
Our North Star for HTTP errors will be the Problem Details for HTTP APIs (RFC 9457/7807):
{
"type": "https://example.com/docs/errors/validation-failed",
"title": "Validation Failed",
"status": 400,
"detail": "The request body failed validation.",
"instance": "/v1/todos",
"invalid_params": [
{
"field": "title",
"reason": "must not be empty"
}
]
}
We will adapt this model for gRPC by embedding a similar structure in the gRPC status details, creating a single source of truth for all errors.
API Design
Let’s start by defining our TODO API in Protocol Buffers:
Our implementation demonstrates several key best practices:
1. Consistent Error Format
All errors follow RFC 9457 (Problem Details) format, providing:
Machine-readable type URIs
Human-readable titles and details
HTTP status codes
Request tracing
Extensible metadata
2. Comprehensive Validation
All validation errors are returned at once, not one by one
Clear field paths for nested objects
Descriptive error codes and messages
Support for batch operations with partial success
3. Security-Conscious Design
No sensitive information in error messages
Internal errors are logged but not exposed
Generic messages for authentication failures
Request IDs for support without exposing internals
4. Developer Experience
Clear, actionable error messages
Helpful suggestions for fixing issues
Consistent error codes across protocols
Rich metadata for debugging
5. Protocol Compatibility
Seamless translation between gRPC and HTTP
Proper status code mapping
Preservation of error details across protocols
6. Observability
Structured logging with trace IDs
Prometheus metrics for monitoring
OpenTelemetry integration
Error categorization for analysis
Conclusion
This comprehensive guide demonstrates how to build robust error handling for modern APIs. By treating errors as a first-class feature of our API, we’ve achieved several key benefits:
Consistency: All errors, regardless of their source, are presented to clients in a predictable format.
Clarity: Developers consuming our API get clear, actionable feedback, helping them debug and integrate faster.
Developer Ergonomics: Our internal service code is cleaner, as handlers focus on business logic while the middleware handles the boilerplate of error conversion.
Security: We have a clear separation between internal error details (for logging) and public error responses, preventing leaks.
In the world of cloud-native applications, service lifecycle management is often an afterthought—until it causes a production outage. Whether you’re running gRPC or REST APIs on Kubernetes with Istio, proper lifecycle management is the difference between smooth deployments and 3 AM incident calls. Consider these scenarios:
Your service takes 45 seconds to warm up its cache, but Kubernetes kills it after 30 seconds of startup wait.
During deployments, clients receive connection errors as pods terminate abruptly.
A hiccup in a database or dependent service causes your entire service mesh to cascade fail.
Your service mesh sidecar shuts down before your application is terminated or drops in-flight requests.
A critical service receives SIGKILL during transaction processing, leaving data in inconsistent states.
After a regional outage, services restart but data drift goes undetected for hours.
Your RTO target is 15 seconds, but services take 30 seconds just to start up properly.
These aren’t edge cases—they’re common problems that proper lifecycle management solves. More critically, unsafe shutdowns can cause data corruption, financial losses, and breach compliance requirements. This guide covers what you need to know about building services that start safely, shut down gracefully, and handle failures intelligently.
The Hidden Complexity of Service Lifecycles
Modern microservices don’t exist in isolation. A typical request might flow through:
Each layer adds complexity to startup and shutdown sequences. Without proper coordination, you’ll experience:
Startup race conditions: Application tries to make network calls before the sidecar proxy is ready
Shutdown race conditions: Sidecar terminates while the application is still processing requests
Premature traffic: Load balancer routes traffic before the application is truly ready
Data corruption: In-flight transactions get interrupted, leaving databases in inconsistent states
Compliance violations: Financial services may face regulatory penalties for data integrity failures
Core Concepts: The Three Types of Health Checks
Kubernetes provides three distinct probe types, each serving a specific purpose:
1. Liveness Probe: “Is the process alive?”
Detects deadlocks and unrecoverable states
Should be fast and simple (e.g., HTTP GET /healthz)
Failure triggers container restart
Common mistake: Making this check too complex
2. Readiness Probe: “Can the service handle traffic?”
Validates all critical dependencies are available
Prevents routing traffic to pods that aren’t ready
Should perform “deep” checks of dependencies
Common mistake: Using the same check as liveness
3. Startup Probe: “Is the application still initializing?”
Provides grace period for slow-starting containers
Disables liveness/readiness probes until successful
Prevents restart loops during initialization
Common mistake: Not using it for slow-starting apps
The Hidden Dangers of Unsafe Shutdowns
While graceful shutdown is ideal, it’s not always possible. Kubernetes will send SIGKILL after the termination grace period, and infrastructure failures can terminate pods instantly. This creates serious risks:
Data Corruption Scenarios
Financial Transaction Example:
// DANGEROUS: Non-atomic operation
func (s *PaymentService) ProcessPayment(req *PaymentRequest) error {
// Step 1: Debit source account
if err := s.debitAccount(req.FromAccount, req.Amount); err != nil {
return err
}
// ???? SIGKILL here leaves money debited but not credited
// Step 2: Credit destination account
if err := s.creditAccount(req.ToAccount, req.Amount); err != nil {
// Money is lost! Source debited but destination not credited
return err
}
// Step 3: Record transaction
return s.recordTransaction(req)
}
E-commerce Inventory Example:
// DANGEROUS: Race condition during shutdown
func (s *InventoryService) ReserveItem(req *ReserveRequest) error {
// Check availability
if s.getStock(req.ItemID) < req.Quantity {
return ErrInsufficientStock
}
// ???? SIGKILL here can cause double-reservation
// Another request might see the same stock level
// Reserve the item
return s.updateStock(req.ItemID, -req.Quantity)
}
RTO/RPO Impact
Recovery Time Objective (RTO): How quickly can we restore service?
Poor lifecycle management increases startup time
Services may need manual intervention to reach consistent state
Cascading failures extend recovery time across the entire system
Recovery Point Objective (RPO): How much data can we afford to lose?
Unsafe shutdowns can corrupt recent transactions
Without idempotency, replay of messages may create duplicates
Data inconsistencies may not be detected until much later
The Anti-Entropy Solution
Since graceful shutdown isn’t always possible, production systems need reconciliation processes to detect and repair inconsistencies:
// Anti-entropy pattern for data consistency
type ReconciliationService struct {
paymentDB PaymentDatabase
accountDB AccountDatabase
auditLog AuditLogger
alerting AlertingService
}
func (r *ReconciliationService) ReconcilePayments(ctx context.Context) error {
// Find payments without matching account entries
orphanedPayments, err := r.paymentDB.FindOrphanedPayments(ctx)
if err != nil {
return err
}
for _, payment := range orphanedPayments {
// Check if this was a partial transaction
sourceDebit, _ := r.accountDB.GetTransaction(payment.FromAccount, payment.ID)
destCredit, _ := r.accountDB.GetTransaction(payment.ToAccount, payment.ID)
switch {
case sourceDebit != nil && destCredit == nil:
// Complete the transaction
if err := r.creditAccount(payment.ToAccount, payment.Amount); err != nil {
r.alerting.SendAlert("Failed to complete orphaned payment", payment.ID)
continue
}
r.auditLog.RecordReconciliation("completed_payment", payment.ID)
case sourceDebit == nil && destCredit != nil:
// Reverse the credit
if err := r.debitAccount(payment.ToAccount, payment.Amount); err != nil {
r.alerting.SendAlert("Failed to reverse orphaned credit", payment.ID)
continue
}
r.auditLog.RecordReconciliation("reversed_credit", payment.ID)
default:
// Both or neither exist - needs investigation
r.alerting.SendAlert("Ambiguous payment state", payment.ID)
}
}
return nil
}
// Run reconciliation periodically
func (r *ReconciliationService) Start(ctx context.Context) {
ticker := time.NewTicker(5 * time.Minute)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if err := r.ReconcilePayments(ctx); err != nil {
log.Printf("Reconciliation failed: %v", err)
}
}
}
}
Building a Resilient Service: Complete Example
Let’s build a production-ready service that demonstrates all best practices. We’ll create two versions: one with anti-patterns (bad-service) and one with best practices (good-service).
The Application Code
//go:generate protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative api/demo.proto
package main
import (
"context"
"flag"
"fmt"
"log"
"net"
"net/http"
"os"
"os/signal"
"sync/atomic"
"syscall"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
health "google.golang.org/grpc/health/grpc_health_v1"
"google.golang.org/grpc/status"
)
// Service represents our application with health state
type Service struct {
isHealthy atomic.Bool
isShuttingDown atomic.Bool
activeRequests atomic.Int64
dependencyHealthy atomic.Bool
}
// HealthChecker implements the gRPC health checking protocol
type HealthChecker struct {
svc *Service
}
func (h *HealthChecker) Check(ctx context.Context, req *health.HealthCheckRequest) (*health.HealthCheckResponse, error) {
service := req.GetService()
// Liveness: Simple check - is the process responsive?
if service == "" || service == "liveness" {
if h.svc.isShuttingDown.Load() {
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_NOT_SERVING,
}, nil
}
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_SERVING,
}, nil
}
// Readiness: Deep check - can we handle traffic?
if service == "readiness" {
// Check application health
if !h.svc.isHealthy.Load() {
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_NOT_SERVING,
}, nil
}
// Check critical dependencies
if !h.svc.dependencyHealthy.Load() {
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_NOT_SERVING,
}, nil
}
// Check if shutting down
if h.svc.isShuttingDown.Load() {
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_NOT_SERVING,
}, nil
}
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_SERVING,
}, nil
}
// Synthetic readiness: Complex business logic check for monitoring
if service == "synthetic-readiness" {
// Simulate a complex health check that validates business logic
// This would make actual API calls, database queries, etc.
if !h.performSyntheticCheck(ctx) {
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_NOT_SERVING,
}, nil
}
return &health.HealthCheckResponse{
Status: health.HealthCheckResponse_SERVING,
}, nil
}
return nil, status.Errorf(codes.NotFound, "unknown service: %s", service)
}
func (h *HealthChecker) performSyntheticCheck(ctx context.Context) bool {
// In a real service, this would:
// 1. Create a test transaction
// 2. Query the database
// 3. Call dependent services
// 4. Validate the complete flow works
return h.svc.isHealthy.Load() && h.svc.dependencyHealthy.Load()
}
func (h *HealthChecker) Watch(req *health.HealthCheckRequest, server health.Health_WatchServer) error {
return status.Error(codes.Unimplemented, "watch not implemented")
}
// DemoServiceServer implements your business logic
type DemoServiceServer struct {
UnimplementedDemoServiceServer
svc *Service
}
func (s *DemoServiceServer) ProcessRequest(ctx context.Context, req *ProcessRequest) (*ProcessResponse, error) {
s.svc.activeRequests.Add(1)
defer s.svc.activeRequests.Add(-1)
// Simulate processing
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(100 * time.Millisecond):
return &ProcessResponse{
Result: fmt.Sprintf("Processed: %s", req.GetData()),
}, nil
}
}
func main() {
var (
port = flag.Int("port", 8080, "gRPC port")
mgmtPort = flag.Int("mgmt-port", 8090, "Management port")
startupDelay = flag.Duration("startup-delay", 10*time.Second, "Startup delay")
)
flag.Parse()
svc := &Service{}
svc.dependencyHealthy.Store(true) // Assume healthy initially
// Management endpoints for testing
mux := http.NewServeMux()
mux.HandleFunc("/toggle-health", func(w http.ResponseWriter, r *http.Request) {
current := svc.dependencyHealthy.Load()
svc.dependencyHealthy.Store(!current)
fmt.Fprintf(w, "Dependency health toggled to: %v\n", !current)
})
mux.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "active_requests %d\n", svc.activeRequests.Load())
fmt.Fprintf(w, "is_healthy %v\n", svc.isHealthy.Load())
fmt.Fprintf(w, "is_shutting_down %v\n", svc.isShuttingDown.Load())
})
mgmtServer := &http.Server{
Addr: fmt.Sprintf(":%d", *mgmtPort),
Handler: mux,
}
// Start management server
go func() {
log.Printf("Management server listening on :%d", *mgmtPort)
if err := mgmtServer.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("Management server failed: %v", err)
}
}()
// Simulate slow startup
log.Printf("Starting application (startup delay: %v)...", *startupDelay)
time.Sleep(*startupDelay)
svc.isHealthy.Store(true)
log.Println("Application initialized and ready")
// Setup gRPC server
lis, err := net.Listen("tcp", fmt.Sprintf(":%d", *port))
if err != nil {
log.Fatalf("Failed to listen: %v", err)
}
grpcServer := grpc.NewServer()
RegisterDemoServiceServer(grpcServer, &DemoServiceServer{svc: svc})
health.RegisterHealthServer(grpcServer, &HealthChecker{svc: svc})
// Start gRPC server
go func() {
log.Printf("gRPC server listening on :%d", *port)
if err := grpcServer.Serve(lis); err != nil {
log.Fatalf("gRPC server failed: %v", err)
}
}()
// Wait for shutdown signal
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
sig := <-sigCh
log.Printf("Received signal: %v, starting graceful shutdown...", sig)
// Graceful shutdown sequence
svc.isShuttingDown.Store(true)
svc.isHealthy.Store(false) // Fail readiness immediately
// Stop accepting new requests
grpcServer.GracefulStop()
// Wait for active requests to complete
timeout := time.After(30 * time.Second)
ticker := time.NewTicker(100 * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-timeout:
log.Println("Shutdown timeout reached, forcing exit")
os.Exit(1)
case <-ticker.C:
active := svc.activeRequests.Load()
if active == 0 {
log.Println("All requests completed")
goto shutdown
}
log.Printf("Waiting for %d active requests to complete...", active)
}
}
shutdown:
// Cleanup
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
mgmtServer.Shutdown(ctx)
log.Println("Graceful shutdown complete")
}
Kubernetes Manifests: Anti-Patterns vs Best Practices
Bad Service (Anti-Patterns)
apiVersion: apps/v1
kind: Deployment
metadata:
name: bad-service
namespace: demo
spec:
replicas: 2
selector:
matchLabels:
app: bad-service
template:
metadata:
labels:
app: bad-service
# MISSING: Critical Istio annotations!
spec:
# DEFAULT: Only 30s grace period
containers:
- name: app
image: myregistry/demo-service:latest
ports:
- containerPort: 8080
name: grpc
- containerPort: 8090
name: mgmt
args: ["--startup-delay=45s"] # Longer than default probe timeout!
# ANTI-PATTERN: Identical liveness and readiness probes
livenessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:8080"]
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3 # Will fail after 40s total
readinessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:8080"] # Same as liveness!
initialDelaySeconds: 10
periodSeconds: 10
# MISSING: No startup probe for slow initialization
# MISSING: No preStop hook for graceful shutdown
Istio Service Mesh: Beyond Basic Lifecycle Management
While proper health checks and graceful shutdown are foundational, Istio adds critical production-grade capabilities that dramatically improve fault tolerance:
Important: Services requiring more than 90-120 seconds to shut down should be re-architected using checkpoint-and-resume patterns.
Advanced Patterns for Production
1. Idempotency: Handling Duplicate Requests
Critical for production: When pods restart or network issues occur, clients may retry requests. Without idempotency, this can cause duplicate transactions, corrupted state, or financial losses. This is mandatory for all state-modifying operations.
package idempotency
import (
"context"
"crypto/sha256"
"encoding/hex"
"time"
"sync"
"errors"
)
var (
ErrDuplicateRequest = errors.New("duplicate request detected")
ErrProcessingInProgress = errors.New("request is currently being processed")
)
// IdempotencyStore tracks request execution with persistence
type IdempotencyStore struct {
mu sync.RWMutex
records map[string]*Record
persister PersistenceLayer // Database or Redis for durability
}
type Record struct {
Key string
Response interface{}
Error error
Status ProcessingStatus
ExpiresAt time.Time
CreatedAt time.Time
ProcessedAt *time.Time
}
type ProcessingStatus int
const (
StatusPending ProcessingStatus = iota
StatusProcessing
StatusCompleted
StatusFailed
)
// ProcessIdempotent ensures exactly-once processing semantics
func (s *IdempotencyStore) ProcessIdempotent(
ctx context.Context,
key string,
ttl time.Duration,
fn func() (interface{}, error),
) (interface{}, error) {
// Check if we've seen this request before
s.mu.RLock()
record, exists := s.records[key]
s.mu.RUnlock()
if exists {
switch record.Status {
case StatusCompleted:
if time.Now().Before(record.ExpiresAt) {
return record.Response, record.Error
}
case StatusProcessing:
return nil, ErrProcessingInProgress
case StatusFailed:
if time.Now().Before(record.ExpiresAt) {
return record.Response, record.Error
}
}
}
// Mark as processing
record = &Record{
Key: key,
Status: StatusProcessing,
ExpiresAt: time.Now().Add(ttl),
CreatedAt: time.Now(),
}
s.mu.Lock()
s.records[key] = record
s.mu.Unlock()
// Persist the processing state
if err := s.persister.Save(ctx, record); err != nil {
return nil, err
}
// Execute the function
response, err := fn()
processedAt := time.Now()
// Update record with result
s.mu.Lock()
record.Response = response
record.Error = err
record.ProcessedAt = &processedAt
if err != nil {
record.Status = StatusFailed
} else {
record.Status = StatusCompleted
}
s.mu.Unlock()
// Persist the final state
s.persister.Save(ctx, record)
return response, err
}
// Example: Idempotent payment processing
func (s *PaymentService) ProcessPayment(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
// Generate idempotency key from request
key := generateIdempotencyKey(req)
result, err := s.idempotencyStore.ProcessIdempotent(
ctx,
key,
24*time.Hour, // Keep records for 24 hours
func() (interface{}, error) {
// Atomic transaction processing
return s.processPaymentTransaction(ctx, req)
},
)
if err != nil {
return nil, err
}
return result.(*PaymentResponse), nil
}
// Atomic transaction processing
func (s *PaymentService) processPaymentTransaction(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
// Use database transaction for atomicity
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return nil, err
}
defer tx.Rollback()
// Step 1: Validate accounts
if err := s.validateAccounts(ctx, tx, req); err != nil {
return nil, err
}
// Step 2: Process payment atomically
paymentID, err := s.executePayment(ctx, tx, req)
if err != nil {
return nil, err
}
// Step 3: Commit transaction
if err := tx.Commit(); err != nil {
return nil, err
}
return &PaymentResponse{
PaymentID: paymentID,
Status: "completed",
Timestamp: time.Now(),
}, nil
}
2. Checkpoint and Resume: Long-Running Operations
For operations that may exceed the termination grace period, implement checkpointing:
package checkpoint
import (
"context"
"encoding/json"
"time"
)
type CheckpointStore interface {
Save(ctx context.Context, id string, state interface{}) error
Load(ctx context.Context, id string, state interface{}) error
Delete(ctx context.Context, id string) error
}
type BatchProcessor struct {
store CheckpointStore
checkpointFreq int
}
type BatchState struct {
JobID string `json:"job_id"`
TotalItems int `json:"total_items"`
Processed int `json:"processed"`
LastItem string `json:"last_item"`
StartedAt time.Time `json:"started_at"`
}
func (p *BatchProcessor) ProcessBatch(ctx context.Context, jobID string, items []string) error {
// Try to resume from checkpoint
state := &BatchState{JobID: jobID}
if err := p.store.Load(ctx, jobID, state); err == nil {
log.Printf("Resuming job %s from item %d", jobID, state.Processed)
items = items[state.Processed:]
} else {
// New job
state = &BatchState{
JobID: jobID,
TotalItems: len(items),
Processed: 0,
StartedAt: time.Now(),
}
}
// Process items with periodic checkpointing
for i, item := range items {
select {
case <-ctx.Done():
// Save progress before shutting down
state.LastItem = item
return p.store.Save(ctx, jobID, state)
default:
// Process item
if err := p.processItem(ctx, item); err != nil {
return err
}
state.Processed++
state.LastItem = item
// Checkpoint periodically
if state.Processed%p.checkpointFreq == 0 {
if err := p.store.Save(ctx, jobID, state); err != nil {
log.Printf("Failed to checkpoint: %v", err)
}
}
}
}
// Job completed, remove checkpoint
return p.store.Delete(ctx, jobID)
}
3. Circuit Breaker Pattern for Dependencies
Protect your service from cascading failures:
package circuitbreaker
import (
"context"
"sync"
"time"
)
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
mu sync.RWMutex
state State
failures int
successes int
lastFailureTime time.Time
maxFailures int
resetTimeout time.Duration
halfOpenRequests int
}
func (cb *CircuitBreaker) Call(ctx context.Context, fn func() error) error {
cb.mu.RLock()
state := cb.state
cb.mu.RUnlock()
if state == StateOpen {
// Check if we should transition to half-open
cb.mu.Lock()
if time.Since(cb.lastFailureTime) > cb.resetTimeout {
cb.state = StateHalfOpen
cb.successes = 0
state = StateHalfOpen
}
cb.mu.Unlock()
}
if state == StateOpen {
return ErrCircuitOpen
}
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failures++
cb.lastFailureTime = time.Now()
if cb.failures >= cb.maxFailures {
cb.state = StateOpen
log.Printf("Circuit breaker opened after %d failures", cb.failures)
}
return err
}
if state == StateHalfOpen {
cb.successes++
if cb.successes >= cb.halfOpenRequests {
cb.state = StateClosed
cb.failures = 0
log.Println("Circuit breaker closed")
}
}
return nil
}
Testing Your Implementation
Manual Testing Guide
Test 1: Startup Race Condition
Setup:
# Deploy both services
kubectl apply -f k8s/bad-service.yaml
kubectl apply -f k8s/good-service.yaml
# Watch pods in separate terminal
watch kubectl get pods -n demo
Test the bad service:
# Force restart
kubectl delete pod -l app=bad-service -n demo
# Observe: Pod will enter CrashLoopBackOff due to liveness probe
# killing it before 45s startup completes
Test the good service:
# Force restart
kubectl delete pod -l app=good-service -n demo
# Observe: Pod stays in 0/1 Ready state for ~45s, then becomes ready
# No restarts occur thanks to startup probe
Test 2: Data Consistency Under Failure
Setup:
# Deploy payment service with reconciliation enabled
kubectl apply -f k8s/payment-service.yaml
# Start payment traffic generator
kubectl run payment-generator --image=payment-client:latest \
--restart=Never --rm -it -- \
--target=payment-service.demo.svc.cluster.local:8080 \
--rate=10 --duration=60s
Simulate SIGKILL during transactions:
# In another terminal, kill pods abruptly
while true; do
kubectl delete pod -l app=payment-service -n demo --force --grace-period=0
sleep 30
done
Service lifecycle management is not just about preventing outages—it’s about building systems that are predictable, observable, and resilient to the inevitable failures that occur in distributed systems. This allows:
Zero-downtime deployments: Services gracefully handle rollouts without data loss.
Improved reliability: Proper health checks prevent cascading failures.
Better observability: Clear signals about service state and data consistency.
Faster recovery: Services self-heal from transient failures.
Data integrity: Idempotency and reconciliation prevent corruption.
Compliance readiness: Meet RTO/RPO requirements for disaster recovery.
Financial protection: Prevent duplicate transactions and data corruption that could cost millions.
The difference between a service that “works on my machine” and one that thrives in production lies in these details. Whether you’re running on GKE, EKS, or AKS, these patterns form the foundation of production-ready microservices.
Want to test these patterns yourself? The complete code examples and deployment manifests are available on GitHub.
In any complex operational environment, the most challenging processes are often those that can’t be fully automated. A CI/CD pipeline might be 99% automated, but that final push to production requires a sign-off. A disaster recovery plan might be scripted, but you need a human to make the final call to failover. These “human-in-the-loop” scenarios are where rigid automation fails and manual checklists introduce risk.
Formicary is a distributed orchestration engine designed to bridge this gap. It allows you to codify your entire operational playbook—from automated scripts to manual verification steps—into a single, version-controlled workflow. This post will guide you through Formicary‘s core concepts and demonstrate how to build two powerful, real-world playbooks:
A Secure CI/CD Pipeline that builds, scans, and deploys to staging, then pauses for manual approval before promoting to production.
A Semi-Automated Disaster Recovery Playbook that uses mocked Infrastructure as Code (IaC) to provision a new environment and waits for an operator’s go-ahead before failing over.
Formicary Features and Architecture
Formicary combines the robust workflow capabilities with the practical CI/CD features, all in a self-hosted, extensible platform.
Core Features
Declarative Workflows: Define complex jobs as a Directed Acyclic Graph (DAG) in a single, human-readable YAML file. Your entire playbook is version-controlled code.
Versatile Executors: A task is not tied to a specific runtime. Use the method that fits the job: KUBERNETES, DOCKER, SHELL, or even HTTP API calls.
Advanced Flow Control: Go beyond simple linear stages. Use on_exit_code to branch your workflow based on a script’s result, create polling “sensor” tasks, and define robust retry logic.
Manual Approval Gates: Explicitly define MANUAL tasks that pause the workflow and require human intervention to proceed via the UI or API.
Security Built-in: Manage secrets with database-level encryption and automatic log redaction. An RBAC model controls user access.
Architecture in a Nutshell
Formicary operates on a leader-follower model. The Queen server acts as the control plane, while one or more Ant workers form the execution plane.
Queen Server: The central orchestrator. It manages job definitions, schedules pending jobs based on priority, and tracks the state of all workers and executions.
Ant Workers: The workhorses. They register with the Queen, advertising their capabilities (e.g., supported executors and tags like gpu-enabled). They pick up tasks from the message queue and execute them.
Backend: Formicary relies on a database (like Postgres or MySQL) for state, a message queue (like Go Channels, Redis or Pulsar) for communication, and an S3-compatible object store for artifacts.
Getting Started: A Local Formicary Environment
The quickest way to get started is with the provided Docker Compose setup.
Prerequisites
Docker & Docker Compose
A local Kubernetes cluster (like Docker Desktop’s Kubernetes, Minikube, or k3s) with its kubeconfig file correctly set up. The embedded Ant worker will use this to run Kubernetes tasks.
Installation Steps
Clone the Repository:git clone https://github.com/bhatti/formicary.git && cd formicary
Launch the System: This command starts the Queen server, a local Ant worker, Redis, and MinIO object storage. docker-compose up
Explore the Dashboard: Once the services are running, open your browser to http://localhost:7777.
Example 1: Secure CI/CD with Manual Production Deploy
Our goal is to build a CI/CD pipeline for a Go application that:
Builds the application binary.
Runs static analysis (gosec) and saves the report.
Deploys automatically to a staging environment.
Pauses for manual verification.
If approved, deploys to production.
Here is the complete playbook definition:
job_type: secure-go-cicd
description: Build, scan, and deploy a Go application with a manual production gate.
tasks:
- task_type: build
method: KUBERNETES
container:
image: golang:1.24-alpine
script:
- echo "Building Go binary..."
- go build -o my-app ./...
artifacts:
paths: [ "my-app" ]
on_completed: security-scan
- task_type: security-scan
method: KUBERNETES
container:
image: securego/gosec:latest
allow_failure: true # We want the report even if it finds issues
script:
- echo "Running SAST scan with gosec..."
# The -no-fail flag prevents the task from failing the pipeline immediately.
- gosec -fmt=sarif -out=gosec-report.sarif ./...
artifacts:
paths: [ "gosec-report.sarif" ]
on_completed: deploy-staging
- task_type: deploy-staging
method: KUBERNETES
dependencies: [ "build" ]
container:
image: alpine:latest
script:
- echo "Deploying ./my-app to staging..."
- sleep 5 # Simulate deployment work
- echo "Staging deployment complete. Endpoint: http://staging.example.com"
on_completed: verify-production-deploy
- task_type: verify-production-deploy
method: MANUAL
description: "Staging deployment complete. A security scan report is available as an artifact. Please verify the staging environment and the report before promoting to production."
on_exit_code:
APPROVED: promote-production
REJECTED: rollback-staging
- task_type: promote-production
method: KUBERNETES
dependencies: [ "build" ]
container:
image: alpine:latest
script:
- echo "PROMOTING ./my-app TO PRODUCTION! This is a critical, irreversible step."
on_completed: cleanup
- task_type: rollback-staging
method: KUBERNETES
container:
image: alpine:latest
script:
- echo "Deployment was REJECTED. Rolling back staging environment now."
on_completed: cleanup
- task_type: cleanup
method: KUBERNETES
always_run: true
container:
image: alpine:latest
script:
- echo "Pipeline finished."
Executing the Playbook
Upload the Job Definition:curl -X POST http://localhost:7777/api/jobs/definitions \ -H "Content-Type: application/yaml" \ --data-binary @playbooks/secure-ci-cd.yaml
Submit the Job Request:curl -X POST http://localhost:7777/api/jobs/requests \ -H "Content-Type: application/json" \ -d '{"job_type": "secure-go-cicd"}'
Monitor and Approve:
Go to the dashboard. You will see the job run through build, security-scan, and deploy-staging.
The job will then enter the MANUAL_APPROVAL_REQUIRED state.
On the job’s detail page, you will see an “Approve” button next to the verify-production-deploy task.
To approve via the API, get the Job Request ID and the Task Execution ID from the UI or API, then run:
Once approved, the playbook will proceed to promote-production and run the final cleanup step.
Example 2: Semi-Automated Disaster Recovery Playbook
Now for a more critical scenario: failing over a service to a secondary region. This playbook uses mocked IaC steps and pauses for the crucial final decision.
job_type: aws-region-failover
description: A playbook to provision and failover to a secondary region.
tasks:
- task_type: check-primary-status
method: KUBERNETES
container:
image: alpine:latest
script:
- echo "Pinging primary region endpoint... it's down! Initiating failover procedure."
- exit 1 # Simulate failure to trigger the 'on_failed' path
on_completed: no-op # This path is not taken in our simulation
on_failed: provision-secondary-infra
- task_type: provision-secondary-infra
method: KUBERNETES
container:
image: hashicorp/terraform:light
script:
- echo "Simulating 'terraform apply' to provision DR infrastructure in us-west-2..."
- sleep 10 # Simulate time for infra to come up
- echo "Terraform apply complete. Outputting simulated state file."
- echo '{"aws_instance.dr_server": {"id": "i-12345dr"}}' > terraform.tfstate
artifacts:
paths: [ "terraform.tfstate" ]
on_completed: verify-failover
- task_type: verify-failover
method: MANUAL
description: "Secondary infrastructure in us-west-2 has been provisioned. The terraform.tfstate file is available as an artifact. Please VERIFY COSTS and readiness. Approve to switch live traffic."
on_exit_code:
APPROVED: switch-dns
REJECTED: teardown-secondary-infra
- task_type: switch-dns
method: KUBERNETES
container:
image: amazon/aws-cli
script:
- echo "CRITICAL: Switching production DNS records to the us-west-2 environment..."
- sleep 5
- echo "DNS failover complete. Traffic is now routed to the DR region."
on_completed: notify-completion
- task_type: teardown-secondary-infra
method: KUBERNETES
container:
image: hashicorp/terraform:light
script:
- echo "Failover REJECTED. Simulating 'terraform destroy' for secondary infrastructure..."
- sleep 10
- echo "Teardown complete."
on_completed: notify-completion
- task_type: notify-completion
method: KUBERNETES
always_run: true
container:
image: alpine:latest
script:
- echo "Disaster recovery playbook has concluded."
Executing the DR Playbook
The execution flow is similar to the first example. An operator would trigger this job, wait for the provision-secondary-infra task to complete, download and review the terraform.tfstate artifact, and then make the critical “Approve” or “Reject” decision.
Conclusion
Formicary helps you turn your complex operational processes into reliable, trackable workflows that run automatically. It uses containers to execute tasks and includes manual approval checkpoints, so you can automate your work with confidence. This approach reduces human mistakes while making sure people stay in charge of the important decisions.
Feature flags are key components of modern infrastructure for shipping faster, testing in production, and reducing risk. However, they can also be a fast track to complex outages if not handled with discipline. Google’s recent major outage serves as a case study, and I’ve seen similar issues arise from missteps with feature flags. The core of Google’s incident revolved around a new code path in their “Service Control” system that should have been protected with a feature flag but wasn’t. This path, designed for an additional quota policy check, went directly to production without flag protection. When a policy change with unintended blank fields was replicated globally within seconds, it triggered the untested code path, causing a null pointer that crashed binaries globally. This incident perfectly illustrates why feature flags aren’t just nice-to-have—they’re essential guardrails that prevent exactly these kinds of global outages. Google also didn’t implement proper error handling, and the system didn’t use randomized exponential backoff that resulted in “thundering herd” effect that prolonged recovery.
Let’s dive into common anti-patterns I’ve observed and how we can avoid them:
This is perhaps the most common and dangerous anti-pattern. It involves deploying code behind a feature flag without comprehensive testing all states of that flag (on, off) and the various condition that interact with the flagged feature. It also includes neglecting robust error handling within the flagged code itself without defaulting flags to “off” in production. For example, Google’s Service Control binary crashed due to a null pointer when a new policy was propagated globally. This didn’t adequately tested the code path with empty input and failed to implement proper error handling. I’ve seen similar issues where teams didn’t test the code path protected with a feature flag in a test environment that only manifest in production. In other cases, the flag was accidentally left ON by default for production, leading to immediate issues upon deployment. The Google incident also mentions the problematic code “did not have appropriate error handling.” If the code within your feature flag assumes perfect conditions, it’s a ticking time bomb. These issues can be remedied by:
Default Off in Production: Ensure all new feature flags are disabled by default in production.
Comprehensive Testing: Test the feature with the flag ON and OFF. Crucially, test the specific conditions, data inputs, and configurations that trigger with the new code paths enabled by the flag.
Robust Error Handling: Implement proper error handling within the code controlled by the flag. It should fail gracefully or revert to a safe state if an unexpected issue occurs, not bring down the service.
Consider Testing Costs: If testing all combinations becomes prohibitively expensive or complex, it might indicate the feature is too large for a single flag and should be broken down.
Anti-Pattern 2: Inadequate Peer Review
This anti-pattern manifests when feature flag changes occur without a proper review process. It’s like making direct database changes in production without a change request. For example, Google’s issue was a policy metadata change rather than a direct flag toggle where metadata replicated globally within seconds. It is analogous to flipping a critical global flag without due diligence. If that policy metadata change had been managed like a code change (e.g., via GitOps or Config as a Code with canary rollout, the issue might have been caught earlier. This can be remedied with:
GitOps/Config-as-Code: Manage feature flag configurations as code within your Git repository. This enforces PRs, peer reviews, and provides an auditable history.
Test Flag Rollback: As part of your process, ensure you can easily and reliably roll back a feature flag configuration change, just like you would with code.
Detect Configuration Drift: Ensure that the actual state in production does not drift from what’s expected or version-controlled.
Anti-Pattern 3: Inadequate Authorization and Auditing
This means not protecting enabling/disabling feature flags with proper permissions. Internally, if anyone can flip a production flag via a UI without a PR or a second pair of eyes, we’re exposed. Also, if there’s no clear record of who changed it, when, and why, incident response becomes a frantic scramble. Remedies include:
Strict Access Control: Implement strong Role-Based Access Control (or Relationship-Based Access Control) to limit who can modify flag states or configurations in production.
Comprehensive Auditing: Ensure your feature flagging system provides detailed audit logs for every change: who made the change, what was changed, and when.
Anti-Pattern 4: No Monitoring
Deploying a feature behind a flag and then flipping it on for everyone without closely monitoring its impact is like walking into a dark room and hoping you don’t trip. This can be remedied by actively monitoring feature flags and collecting metrics on your observability platform. This includes tracking not just the flag’s state (on/off) but also its real-time impact on key system metrics (error rates, latency, resource consumption) and relevant business KPIs.
Anti-Pattern 5: No Phased Rollout or Kill Switch
This means turning a new, complex feature on for 100% of users simultaneously with a flag. For example, during Google’s incident, major changes to quota management settings were propagated immediately causing global outage. The “red-button” to disable the problematic serving path was crucial for their recovery. Remedies for this anti-pattern include:
Canary Releases & Phased Rollouts: Don’t enable features for everyone at once. Perform canary releases: enable for internal users, then a small percentage of production users while monitoring metrics.
“Red Button” Control: Have a clear, a “kill switch” or “red button” mechanism for quickly and globally disabling any problematic feature flag if issues arise.
Anti-Pattern 6: Thundering Herd
Enabling a feature flag can potentially change traffic patterns for incoming requests. For example, Google didn’t implement randomized exponential backoff in Service Control that caused “thundering herd” on underlying infrastructure. To prevent such issues, implement exponential backoff with jitter for request retries, combined with comprehensive monitoring.
Anti-Pattern 7: Misusing Flags for Config or Entitlements
Using feature flags as a general-purpose configuration management system or to manage complex user entitlements (e.g., free vs. premium tiers). For example, I’ve seen teams use feature flags to store API endpoints, timeout values, or rules about which customer tier gets which sub-feature. This means that your feature flag system becomes a de-facto distributed configuration database. This can be remedied with:
Purposeful Flags: Use feature flags primarily for controlling the lifecycle of discrete features: progressive rollout, A/B testing, kill switches.
Dedicated Systems: Use proper configuration management tools for application settings and robust entitlement systems for user permissions and plans.
Anti-Pattern 8: The “Zombie Flag” Infestation
Introducing feature flags but never removing them once a feature is fully rolled out or stable. I’ve seen codebases littered with if (isFeatureXEnabled) checks for features that have been live for years or were abandoned. This can be remedied with:
Lifecycle Management: Treat flags as having a defined lifespan.
Scheduled Cleanup: Regularly audit flags. Once a feature is 100% rolled out and stable (or definitively killed), schedule work to remove the flag and associated dead code.
Anti-Pattern 9: Ignoring Flagging Service Health
This means not considering how your application behaves if the feature flagging service itself experiences an outage or is unreachable. A crucial point in Google’s RCA was that their “Cloud Service Health infrastructure being down due to this outage” delayed communication. A colleague once pointed out: what happens if LaunchDarkly is down? This can be remedied with:
Safe Defaults in Code: When your code requests a flag’s state from the SDK (e.g., ldClient.variation("my-feature", user, **false**)), the provided default value is critical. For new or potentially risky features, this default must be the “safe” state (feature OFF).
SDK Resilience: Feature-Flag SDKs are designed to cache flag values and use them if the service is unreachable (stasis). But on a fresh app start before any cache is populated, your coded defaults are your safety net.
Summary
Feature flags are incredibly valuable for modern software development. They empower teams to move faster and release with more confidence. But as the Google incident and my own experiences show, they require thoughtful implementation and ongoing discipline. By avoiding these anti-patterns – by testing thoroughly, using flags for their intended purpose, managing their lifecycle, governing changes, and planning for system failures – we can ensure feature flags remain a powerful asset.
Kubernetes has revolutionized how we deploy, scale, and manage applications in the cloud. I’ve been using Kubernetes for many years to build scalable, resilient, and maintainable services. However, Kubernetes was primarily designed for stateless applications – services that can scale horizontally. While such shared-nothing architecture is must-have for most modern microservices but it presents challenges for use-cases such as:
Stateful/Singleton Processes: Applications that must run as a single instance across a cluster to avoid conflicts, race conditions, or data corruption. Examples include:
Legacy applications not designed for distributed operation
Batch processors that need exclusive access to resources
Job schedulers that must ensure jobs run exactly once
Applications with sequential ID generators
Active/Passive Disaster Recovery: High-availability setups where you need a primary instance running with hot standbys ready to take over instantly if the primary fails.
Traditional Kubernetes primitives like StatefulSets provide stable network identities and ordered deployment but don’t solve the “exactly-one-active” problem. DaemonSets ensure one pod per node, but don’t address the need for a single instance across the entire cluster. This gap led me to develop K8 Highlander – a solution that ensures “there can be only one” active instance of your workloads while maintaining high availability through automatic failover.
Architecture
K8 Highlander implements distributed leader election to ensure only one controller instance is active at any time, with others ready to take over if the leader fails. The name “Highlander” refers to the tagline from the 1980s movie & show: “There can be only one.”
Core Components
The system consists of several key components:
Leader Election: Uses distributed locking (via Redis or a database) to ensure only one controller is active at a time. The leader periodically renews its lock, and if it fails, another controller can acquire the lock and take over.
Workload Manager: Manages different types of workloads in Kubernetes, ensuring they’re running and healthy when this controller is the leader.
Monitoring Server: Provides real-time metrics and status information about the controller and its workloads.
HTTP Server: Serves a dashboard and API endpoints for monitoring and management.
How Leader Election Works
The leader election process follows these steps:
Each controller instance attempts to acquire a distributed lock with a TTL (Time-To-Live)
Only one instance succeeds and becomes the leader
The leader periodically renews its lock to maintain leadership
If the leader fails to renew (due to crash, network issues, etc.), the lock expires
Another instance acquires the lock and becomes the new leader
The new leader starts managing workloads
This approach ensures high availability while preventing split-brain scenarios where multiple instances might be active simultaneously.
Workload Types
K8 Highlander supports four types of workloads:
Process Workloads: Single-instance processes running in pods
CronJob Workloads: Scheduled tasks that run at specific intervals
Service Workloads: Continuously running services using Deployments
Persistent Workloads: Stateful applications with persistent storage using StatefulSets
Each workload type is managed to ensure exactly one instance is running across the cluster, with automatic recreation if terminated unexpectedly.
Deploying and Using K8 Highlander
Let me walk through how to deploy and use K8 Highlander for your singleton workloads.
Prerequisites
Kubernetes cluster (v1.16+)
Redis server or PostgreSQL database for leader state storage
kubectl configured to access your cluster
Installation Using Docker
The simplest way to install K8 Highlander is using the pre-built Docker image:
This deploys K8 Highlander with your configuration, ensuring high availability with multiple replicas while maintaining the singleton behavior for your workloads.
Using K8 Highlander Locally for Testing
You can also run K8 Highlander locally for testing:
K8 Highlander exposes Prometheus metrics at /metrics for monitoring and alerting:
# HELP k8_highlander_is_leader Indicates if this instance is currently the leader (1) or not (0)
# TYPE k8_highlander_is_leader gauge
k8_highlander_is_leader 1
# HELP k8_highlander_leadership_transitions_total Total number of leadership transitions
# TYPE k8_highlander_leadership_transitions_total counter
k8_highlander_leadership_transitions_total 1
# HELP k8_highlander_workload_status Status of managed workloads (1=active, 0=inactive)
# TYPE k8_highlander_workload_status gauge
k8_highlander_workload_status{name="data-processor",namespace="default",type="process"} 1
Key metrics include:
Leadership status and transitions
Workload health and status
Redis/database operations
Failover events and duration
System resource usage
Grafana Dashboard
A Grafana dashboard is available for visualizing K8 Highlander metrics. Import the dashboard from the dashboards directory in the repository.
Advanced Features
Multi-Tenant Support
K8 Highlander supports multi-tenant deployments, where different teams or environments can have their own isolated leader election and workload management:
# Tenant A configuration
id: "controller-1"
tenant: "tenant-a"
namespace: "tenant-a"
# Tenant B configuration
id: "controller-2"
tenant: "tenant-b"
namespace: "tenant-b"
Each tenant has its own leader election process, so one controller can be the leader for tenant A while another is the leader for tenant B.
Multi-Cluster Deployment
For disaster recovery scenarios, K8 Highlander can be deployed across multiple Kubernetes clusters with a shared Redis or database:
If the primary cluster fails, a controller in the secondary cluster can become the leader and take over workload management.
Summary
K8 Highlander fills a critical gap in Kubernetes’ capabilities by providing reliable singleton workload management with automatic failover. It’s ideal for:
Legacy applications that don’t support horizontal scaling
Processes that need exclusive access to resources
Scheduled jobs that should run exactly once
Active/passive high-availability setups
The solution ensures high availability without sacrificing the “exactly one active” constraint that many applications require. By handling the complexity of leader election and workload management, K8 Highlander allows you to run stateful workloads in Kubernetes with confidence.
Where to Go from Here
Check out the GitHub repository for the latest code and documentation
Read the API Reference for detailed endpoint information
K8 Highlander is an open-source project with MIT license, and contributions are welcome! Feel free to submit issues, feature requests, or pull requests to help improve the project.
I became seriously interested in computers after learning about Microprocessor architecture during a special summer camp program at school that taught us about how computer systems work among other topics. I didn’t have easy access to computers at my school and I learned a bit more about programming on an Atari system. This hooked me to take some private lessons on programming languages and pursue computer science college studies in college and a career as a software developer spanning three decades.
My professional journey started mostly with mainframe systems and then I shifted more towards UNIX systems and then later to Linux environments. Along the way, I’ve witnessed entire technological ecosystems rise, thrive, and ultimately vanish like programming languages abandoned despite their elegance, operating systems forgotten despite their robustness, and frameworks discarded despite their innovation. I will dig through my personal experience with some of the archaic technologies that have largely disappeared or diminished in importance. I’ve deliberately omitted technologies I still use regularly to spotlight these digital artifacts. These extinct technologies shaped how we approach computing problems and contain the DNA of our current systems. They remind us that today’s indispensable technologies may someday join them in the digital graveyard.
Programming Languages
BASIC & GW-BASIC
I initially learned BASIC on an Atari system and later learned GW-BASIC, which introduced me to graphics programming on IBM XT computers running early DOS versions. The use of line numbers organizing program flow with GOTO and GOSUB statements seemed strange to me but its simplicity helped me to build create programs with sounds and graphics. Eventually, I moved to Microsoft QuickBASIC that had support for procedures and structured programming. This early taste of programming led me to pursue Computer Science in college. I sometimes worry about today’s beginners facing overwhelming complexity like networking, concurrency, and performance optimization just to build a simple web application. BASIC on the other hand was very accessible and rewarding for newcomers despite its limitations.
Pascal & Turbo Pascal
College introduced me to both C and Pascal through Borland’s Turbo compilers. I liked cleaner and more readable syntax of Pascal compared to C. At the time, C had best performance so Pascal couldn’t gain wide adoption and it has largely disappeared from mainstream development. Interestingly, career of Turbo Pascal’s author, Anders Hejlsberg was saved by Microsoft who went on to create C# and later TypeScript. This trajectory taught me that technical superiority alone doesn’t ensure survival.
FORTRAN
During a college internship at a physics laboratory, I learned about FORTRAN running on massive DEC-VAX/VMS systems, which was very popular among scientific computing at the time. While FORTRAN maintains a niche presence in scientific circles but DEC VAX/VMS systems have vanished entirely from the computing landscape. VMS systems were known for powerful, reliable and stable computing environments but DEC failed to adapt to the industry’s shift toward smaller, more cost-effective systems. The market ultimately embraced UNIX variants that offered comparable capabilities at lower price points with greater flexibility. This transition taught me an early lesson in how economic factors often trump technical superiority.
COBOL, CICS and Assembler
My professional career at a marketing firm began with COBOL, CICS, and Assembler on mainframe. JCL (Job Control Language) was used to submit the mainframe jobs that had unforgiving syntax where a misplaced comma could derail an entire batch job. I used COBOL for batch processing applications that primarily processed sequential ISAM files or the more advanced VSAM files with their B-Tree indexing for direct data access. These batch jobs often ran for hours or even days that created long feedback cycles where a single error could cause cascading delays and missed deadlines.
I used CICS for building interactive applications with their distinctive green-screen terminals. I had to use BMS (Basic Mapping Support) for designing the 3270 terminal screen layouts, which was notoriously finicky language. I built my own tool to convert plain text layouts into proper BMS syntax so that I didn’t have to debug syntax errors. The most challenging language that I had to use was mainframe Assembler, which was used for performance-critical system components. These programs were monolithic workhorses —thousands of lines of code in single routines with custom macros simulating higher-level programming constructs. Thanks to the exponential performance improvements in modern hardware, most developers rarely need to descend to this level of programming.
PERL
I first learned PERL in college and embraced it throughout the 1990s as a versatile tool for both system administration and general-purpose programming. Its killer feature—regular expressions—made it indispensable for text processing tasks that would have been painfully complex in other languages. At a large credit verification company, I leveraged PERL’s pattern-matching to automate massive codebase migrations, transforming thousands of lines of code from one library to another. Later, at a major airline, I used similar techniques to upgrade legacy systems to newer WebLogic APIs without manual rewrites.
In the web development arena, I used PERL to build early CGI applications and it was a key component of revolutionary LAMP stack (Linux, Apache, MySQL, PERL) before PHP/Python supplanted it. The CPAN repository was another groundbreaking innovation that allowed reusing shared libraries at scale. I used it along with Mason web templating system at a large online retailer in the mid 2000s and then migrated some of those applications to Java as PERL based systems were difficult to maintain. I found similar experience with other PERL codebases and I eventually moved to Python, which offered cleaner object-oriented design patterns and syntax. Its cultural impact—from the camel book to CPAN—influenced an entire generation of programmers, myself included.
4th Generation Languages
Early in my career, Fourth Generation Languages (4GLs) promised a dramatic boost productivity by providing a simple UI for managing the data. On mainframe systems, I used Focus and SAS for data queries and analytics, creating reports and processing data with a few lines of code. For desktop applications, I used a variety of 4GL environments including dBase III/IV, FoxPro, Paradox, and Visual Basic. These tools were remarkable for their time, offering “query by example” interfaces that allowed quickly build database applications with minimal coding. However, as data volumes grew, the limitations of these systems became painfully apparent. Eventually, I transitioned to object-oriented languages paired with enterprise relational databases that offered better scalability and maintainability. Nevertheless, these tools represent an important evolutionary step that influenced modern RAD (Rapid Application Development) approaches and low-code platforms that continue to evolve today.
Operating Systems
Mainframe Legacy
My career began at a marketing company working on IBM 360/390 mainframes running MVS (Multiple Virtual Storage). I used a combination of JCL, COBOL, CICS, and Assembler to build batch applications that processed millions of customer records. Working with JCL (Job Control Language) was particularly challenging due to its incredibly strict syntax where a single misplaced comma could cause an entire batch run to fail. The feedback cycle was painfully slow; submitting a job often meant waiting hours or even overnight for results. We had to use extensive “dry runs” of jobs to test the business logic —a precursor to what we now call unit testing. Despite these precautions, mistakes happened, and I witnessed firsthand how a simple programming error caused the company to mail catalogs to incorrect and duplicate addresses, costing millions in wasted printing and postage.
These systems also had their quirks: they used EBCDIC character encoding rather than the ASCII standard found in most other systems. They also stored data inefficiently—a contributing factor to the infamous Y2K crisis, as programs commonly stored years as two digits to save precious bytes of memory in an era when storage was extraordinarily expensive. Terminal response times were glacial by today’s standards—I often had to wait several seconds to see what I’d typed appear on screen. Yet despite their limitations, these mainframes offered remarkable reliability. While the UNIX systems I later worked with would frequently crash with core dumps (typically from memory errors in C programs), mainframe systems almost never went down. This stability, however, came partly from their simplicity—most applications were essentially glorified loops processing input files into output files without the complexity of modern systems.
UNIX Variants
Throughout my career, I worked extensively with numerous UNIX variants descended from both AT&T’s System V and UC Berkeley’s BSD lineages. At multiple companies, I deployed applications on Sun Microsystems hardware running SunOS (BSD-based) and later Solaris (System V-based). These systems, while expensive, provided the superior graphics capability, reliability and performance needed for mission-critical applications. I used SGI’s IRIX operating system running on impressive graphical workstations when working at a large physics lab. These systems processed massive datasets from physics experiments by leveraging non-uniform memory access (NUMA) and symmetric multi-processing (SMP) based architecture. IRIX was among the first mainstream 64-bit operating systems, pushing computational boundaries years before this became standard. They were used for visual effects in movies like Jurassic Park to life in 1993, which was amazing to watch. I also worked with IBM’s AIX on SP1/SP2 supercomputers at the physics lab, using Message Passing Interface (MPI) APIs to distribute processing across hundreds of nodes. This message-passing approach ultimately proved more scalable than shared-memory architectures, though modern systems incorporate both paradigms—today’s multi-core processors rely on the same NUMA/SMP concepts pioneered in these early UNIX variants.
On the down side, these systems were very expensive and Moore’s Law enabled commodity PC hardware running Linux to achieve comparable performance at a fraction of the price. I saw a lot of those large systems replaced with a farm of low-cost PCS based on Linux clusters that reduced infrastructure costs drastically. I was deeply passionate about UNIX and even spent most of my savings in the early ’90s on a high-end PowerPC system, which was result of a partnership between IBM, Motorola, Apple, and Sun. This machine could run multiple operating systems including Solaris and AIX, though I primarily used it for personal projects and learning.
DOS, OS/2, SCO and BeOS
For personal computing in the 1980s and early 1990s, I primarily used MS-DOS, even developing several shareware applications and games that I sold through bulletin board systems. DOS, with its command-line interface and conventional/expanded memory limitations taught me valuable lessons about resource optimization that remain relevant even in today. I preferred UNIX-like environments whenever possible so I installed SCO UNIX (based on Microsoft’s Xenix) on my personal computer. SCO was initially respected in the industry before it transformed into a patent troll with controversial patent lawsuits against Linux distributors. I also liked OS/2 and it was a technically superior operating system developed compared to Windows with its support of true pre-emptive multitasking. But it lost to Windows due to massive Microsoft’s market power similar to other innovative competitors like Borland, Novell, and Netscape.
Perhaps the most elegant of these alternative systems was BeOS, which I eagerly tested in the mid-1990s when it released in beta. It supported microkernel design and pervasive multithreading capabilities, and was a serious contender for Apple’s next-generation OS. However, Apple ultimately acquired NeXT instead, bringing Steve Jobs back and adopting NeXTSTEP as the foundation—another case where superior technology lost to business considerations and personal relationships.
Storage Media
My first PC had a modest 40MB hard drive and I relied heavily on floppy disks in both 5.25-inch and later 3.5-inch formats. They took a long time to copy data and made a lot of scratching sounds as both progress indicators and early warning systems for impending failures. In professional environments, I worked with SCSI drives that had better speed and reliability. I generally employed RAID configurations to protect against drive failures. For archiving and backup, I generally used tape drives that were also painfully slow but could store much more data. In mid-1990s, I switched to Iomega’s Zip drives from floppy disks for personal backups that could store up to 100MB compared to 1.44MB floppies. Similarly, I used CD-R and later CD-RW drives for storage that also had slow write speeds initially.
Network Protocols
In my early career, networking was fairly fragmented and I generally used Novell’s proprietary IPX (Internetwork Packet Exchange) protocol and Novell NetWare networks at work. It provided nice support of file sharing and printing service. On mainframe systems, I worked with Token Ring networks that offered more reliable deterministic performance. As the internet was based on TCP/IP, it eventually took over along with UNIX and Linux systems. For file sharing across these various systems, I relied on NFS (Network File System) in UNIX environments and later Samba to bridge the gap between UNIX and Windows systems that used SMB (Server Message Block) protocol. Both solutions were plagued with performance issues due to file locking issues. I spent countless hours troubleshooting “stale file handles” and unexpected disconnections that plagued these early networked file systems.
Databases
My database journey began on mainframe systems with IBM’s VSAM (Virtual Storage Access Method), which wasn’t a true database but provided crucial B-Tree indexing for efficient file access. I also worked with IBM’s IMS, a hierarchical database that organized data in parent-child tree structures. The relational databases were truly revolutionary at the time and I embraced systems like IBM DB2, Oracle, and Microsoft SQL Server. In my college, I took a number of courses in theory of relational databases and appreciated its strong mathematical foundations. However, most of those relational databases were commercial and expensive and I looked at open source projects like MiniSQL but it lacked critical enterprise features like transaction support.
In mid 1990s, I saw object-oriented databases gained popularity along with object-oriented programming that promised to eliminate the “impedance mismatch” between object models and relational tables. I evaluated ObjectStore for some projects and ultimately deployed Versant to manage complex navigation data for traffic mapping systems—predecessors to today’s Google Maps services. These databases elegantly handled complex object relationships and inheritance hierarchies, but introduced their own challenges in querying, scaling, and integration with existing systems. The relational databases later absorbed object-oriented concepts like user-defined types, XML support, and JSON capabilities. Looking back, it taught me that systems built on strong theoretical foundations with incremental adaptation tend to outlast revolutionary approaches.
Security and Authentication
Early in my career, I worked as a UNIX system administrator and relied on /etc/passwd files for authentication that were world-readable, containing password hashes generated with the easily crackable crypt algorithm. For multi-system environments, I used NIS (Network Information Service) to centrally manage user accounts across server clusters. We also commonly used .rhosts files to allow password-less authentication between trusted systems. I later used Kerberos authentication systems to provide stronger single sign-on capabilities for enterprise environments. When working at a large airline, I used Netegrity SiteMinder to implement single sign-on based access. While consulting for a manufacturing company, I built SSO implementations using LDAP and Microsoft Active Directory across heterogeneous systems. The Java ecosystem brought its own authentication frameworks and I worked extensively with JAAS (Java Authentication and Authorization Service) and later Acegi Security before moving to SAML (Security Assertion Markup Language) and OAuth based authentication standards.
Applications & Development Tools
Desktop Applications (Pre-Web)
My early word processing was done in WordStar with its cryptic Ctrl-key commands, before moving to WordPerfect, which offered better formatting control. For technical documentation, I relied on FrameMaker that supported sophisticated layout for complex documents. For spreadsheets, I initially used VisiCalc, which was the original “killer app” on Apple II but later Lotus 1-2-3, which revolutionized common keyboard shortcuts that still exist in Excel today. When working for a marketing company, I used Lotus Notes, a collaboration tool that functioned as an email client, calendar, document management system, and application development platform. On UNIX workstations, I preferred text-based applications like elm and pine for email and lynx text browser when accessing remote machines on telnet.
Chat & Communication Tools
On early UNIX systems at work, I used the simple ‘talk’ command to chat with other users on the system. At home during the pre-internet era, I immersed myself in the Bulletin Board System (BBS) culture. I also hosted my own BBS, learning firsthand about the challenges of building and maintaining online communities. I used CompuServe for access to group forums and Internet Relay Chat (IRC) through painfully slow dial-up and later SLIP/PPP connections. My fascination with IRC led me to develop my own client application, PlexIRC, which I distributed as shareware. As graphical interfaces took over, I adopted ICQ and Yahoo Messenger for personal communications. These platforms introduced status indicators, avatars, and file transfers that we now take for granted. While AOL Instant Messenger dominated the American market, I deliberately avoided the AOL ecosystem, preferring more open alternatives. My professional interest gravitated toward Jabber, which later evolved into the XMPP protocol standard with its federated approach to messaging—allowing different servers to communicate like email. I later implemented XMPP-based messaging solutions for several organizations, appreciating its extensible framework and standardized approach.
Development Environments
On UNIX systems, I briefly wrestled with ‘ed’—a line editor so primitive by today’s standards that its error message was simply a question mark. I quickly graduated to Vi, whose keyboard shortcuts became muscle memory that persists to this day through modern incarnations like Vim and NeoVim. In the DOS world, Borland Sidekick revolutionized my workflow as one of the first TSR (Terminate and Stay Resident) applications. With a quick keystroke, Sidekick would pop up a notepad, calculator, or calendar without exiting the primary application. For debugging and system maintenance, I used Norton Utilities that provided essential tools like disk recovery, defragmentation, and a powerful hex editor that saved countless hours when troubleshooting low-level issues. I learned about the IDE (Integrated Development Environment) through Borland’s groundbreaking products like Turbo Pascal and Turbo C that combined fast compilers with editing and debugging in a seamless package. These evolved into more sophisticated tools like Borland C++ with its application frameworks. For specialized work, I used Watcom C/C++ for its cross-platform capabilities and optimization features. As Java gained prominence, I adopted tools like JBuilder and Visual Cafe, which pioneered visual development for the platform. Eventually, I moved to Eclipse and later IntelliJ IDEA, alongside Visual Studio. Though, I still enable Vi mode on these IDEs due to its powerful editing capabilities without the need of mouse.
Web Technologies
I experienced the early internet ecosystem in college—navigating Gopher menus for document retrieval, searching with WAIS, and participating in Usenet newsgroups. Everything changed with the release of NCSA HTTPd server and the Mosaic browser. I used these revolutionary tools on Sun workstations in college and later at a high-energy physics laboratory on UNIX workstations. I left my cushy job to find web related projects and secured a consulting position at a financial institution building web access for credit card customers. I used C/C++ with CGI (Common Gateway Interface) to build dynamic web applications that connected legacy systems to this new interface. These early days of web development were like the Wild West—no established security practices, framework standards, or even consistent browser implementations existed. During a code review when working at a major credit card company, I discovered a shocking vulnerability: their web application stored usernames and passwords directly in cookies in plaintext, essentially exposing customer credentials to anyone with basic technical knowledge. These early web servers used a process-based concurrency model, spawning a new process for each request—inefficient by modern standards but there wasn’t much user traffic at the time. On the client side, I worked with the Netscape browser, while server implementations expanded to include Apache, Netscape Enterprise Server, and Microsoft’s IIS.
I also built my own Model-View-Controller architecture and templating system because there weren’t any established frameworks available. As Java gained traction, I migrated to JSP and the Struts framework, which formalized MVC patterns for web applications. This evolution continued as web servers evolved from process-based to thread-based concurrency models, and eventually to asynchronous I/O implementations in platforms like Nginx, dramatically improving scalability. Having witnessed the entire evolution—from hand-coded HTML to complex JavaScript frameworks—gives me a unique perspective on how rapidly this technology landscape has developed.
Distributed Systems Development
My journey with distributed systems began with Berkeley Sockets—the foundational API that enabled networked communication between applications. After briefly working with Sun’s RPC (Remote Procedure Call) APIs, I embraced Java’s Socket implementation and then its Remote Method Invocation (RMI) framework, which I used to implement remote services when working as a consultant for an enterprise client. RMI offered the revolutionary ability to invoke methods on remote objects as if they were local, handling network communication transparently and even dynamically loading remote classes. At a major travel booking company, I worked with Java’s JINI technology, which was inspired by Linda memory model and TupleSpace that I also studied during my postgraduate research. JINI extended RMI with service discovery and leasing mechanisms, creating a more robust foundation for distributed applications. I later used GigaSpaces, which expanded the JavaSpaces concept into a full in-memory data grid for session storage.
For personal projects, I explored Voyager, a mobile agent platform that simplified remote object interaction with dynamic proxies and mobile object capabilities. Despite its technical elegance, Voyager never achieved widespread adoption—a pattern I would see repeatedly with technically superior but commercially unsuccessful distributed technologies. While contracting for Intelligent Traffic Systems in the Midwest during the late 1990s, I implemented CORBA-based solutions that collected real-time traffic data from roadway sensors and distributed it to news agencies via a publish-subscribe model. CORBA promised language-neutral interoperability through its Interface Definition Language (IDL), but reality fell short—applications typically worked reliably only when using components from the same vendor. I had to implement custom interceptors to add the authentication and authorization capabilities CORBA lacked natively. Nevertheless, CORBA’s explicit interface definitions via IDL influenced later technologies like gRPC that we still use today. The Java Enterprise (J2EE) era brought Enterprise JavaBeans and I implemented these technologies using BEA WebLogic for another state highway system, and continued working with them at various travel, airline, and fintech companies. EJB’s fatal flaw was attempting to abstract away the distinction between local and remote method calls—encouraging developers to treat distributed objects like local ones. This led to catastrophic performance problems as applications made thousands of network calls for operations that should have been local.
I read Rod Johnson’s influential critique of EJB that eventually evolved into the Spring Framework, offering a more practical approach to Java enterprise development. Around the same time, I transitioned to simpler XML-over-HTTP designs before the industry standardized on SOAP and WSDL. The subsequent explosion of WS-* specifications (WS-Security, WS-Addressing, etc.) created such complexity that the diagram of their interdependencies resembled the Death Star. I eventually abandoned SOAP’s complexity for JSON over HTTP, implementing long-polling and Server-Sent Events (SSE) for real-time applications before adopting the REST architectural style that dominates today’s API landscape. Throughout these transitions, I integrated various messaging systems including IBM WebSphere MQ, JMS implementations, TIBCO Rendezvous, and Apache ActiveMQ to provide asynchronous communication capabilities. This journey through distributed systems technologies reflects a recurring pattern: the industry oscillating between complexity and simplicity, between comprehensive frameworks and minimal viable approaches. The technologies that endured longest were those that acknowledged and respected the fundamental challenges of distributed computing—network unreliability, latency, and the fallacies of distributed computing—rather than attempting to hide them behind leaky abstractions.
Client & Mobile Development
Terminal & Desktop GUI
My journey developing client applications began with CICS on mainframe systems—creating those distinctive green-screen interfaces for 3270 terminals once ubiquitous in banking and government environments. The 4th generation tools era introduced me to dBase and Paradox, which I used to build database-driven applications through their “query by example” interfaces, which allowed rapid development of forms and reports without extensive coding. For personal projects, I developed numerous DOS applications, games, and shareware using Borland Turbo C. As Windows gained prominence, I transitioned to building GUI applications using Borland C++ with OWL (Object Windows Library) and later Microsoft Foundation Classes (MFC), which abstracted the complex Windows API into an object-oriented framework. While working for a credit protection company, I developed UNIX-based client applications using OSF/Motif. Motif’s widget system and resource files offered sophisticated UI capabilities, though with considerable implementation complexity.
Web Clients
The web revolution transformed client development fundamentally. I quickly adopted HTML for financial and government projects, creating browser-based interfaces that eliminated client-side installation requirements. For richer interactive experiences, I embedded Flash elements into web applications—creating animations and interactive components beyond HTML’s capabilities at the time. Java’s introduction brought the promise of “write once, run anywhere,” which I embraced through Java applets that you could embed like Flash widgets. Later, Java Web Start offered a bridge between web distribution and desktop application capabilities, allowing applications to be launched from browsers while running outside their security sandbox. Using Java’s AWT and later Swing libraries, I built standalone applications including IRC and email clients. The client-side JavaScript revolution, catalyzed by Google’s demonstration of AJAX techniques, fundamentally changed web application architecture. I experimented with successive generations of JavaScript libraries—Prototype.js for DOM manipulation, Script.aculo.us for animations, YUI for more component sets, etc.
Embedded and Mobile Development
As Java had its roots in embedded/TV systems, it introduced a wearable smart Java ring with an embedded microchip that I used for some personal security applications. Though the Java Ring quickly disappeared from the market, its technological descendants like the iButton continued finding specialized applications in security and authentication systems. The mobile revolution began in earnest with the Palm Pilot—a breakthrough device featuring the innovative Graffiti handwriting recognition system that transformed how we interacted with portable computers. I embraced Palm development, creating applications for this pioneering platform and carrying a Palm device for years. As mobile technologies evolved, I explored the Wireless Application Protocol (WAP), which attempted to bring web content to the limited displays and bandwidth of early mobile phones but failed to gain widespread adoption. When Java introduced J2ME (Java 2 Micro Edition), I invested heavily in mastering this platform, attracted by its promise of cross-device compatibility across various feature phones. I developed applications targeting the constrained CLDC (Connected Limited Device Configuration) and MIDP (Mobile Information Device Profile) specifications.
The entire mobile landscape transformed dramatically when Apple introduced the first iPhone in 2007—a genuine paradigm shift that redefined our expectations for mobile devices. Recognizing this fundamental change, I learned iOS development using Objective-C with its message-passing syntax and manual memory management. This investment paid off when I developed an iOS application for a fintech company that significantly contributed to its acquisition by a larger trading firm. Early mobile development eerily mirrored my experiences with early desktop computing—working within severe hardware constraints that demanded careful resource management. Despite theoretical advances in programming abstractions, I found myself once again meticulously optimizing memory usage, minimizing disk operations, and carefully managing network bandwidth. This return to fundamental computing constraints reinforced my appreciation for efficiency-minded development practices that remain valuable even as hardware capabilities continue expanding.
Development Methodologies
My first corporate experience introduced me to Total Quality Management (TQM), with its focus on continuous improvement and customer satisfaction. This early exposure taught me a crucial lesson: methodology adoption depends more on organizational culture than on the framework itself. Despite new terminology and reorganized org charts, the company largely maintained its existing practices with superficial changes. Later, I worked with organizations implementing the Capability Maturity Model (CMM), which attempted to categorize development processes into five maturity levels. While this framework provided useful structure for improving chaotic environments, its documentation requirements and formal assessments often created bureaucratic overhead that impeded actual development. Similarly, the Rational Unified Process (RUP), which I used at several companies, offered comprehensive guidance but it turned into waterfall development model in many projects. The agile revolution emerged as a reaction against these heavyweight methodologies. I applied elements of Feature-Driven Development and Spiral methodologies when working at a major airline, focusing on iterative development and explicit risk management. I explored various agile approaches during this period—Crystal’s focus on team communication, Adaptive Software Development’s emphasis on change tolerance, and particularly Extreme Programming (XP), which introduced practices like test-driven development and pair programming that fundamentally changed how I approached code quality. Eventually, most organizations where I worked settled on customized implementations of Scrum and Kanban—frameworks that continue to dominate agile practice today.
Development Methodologies & Modeling
Earlier in my career, approaches like Rapid Application Development (RAD) and Joint Application Development (JAD) emphasized quick prototyping and intensive stakeholder workshops. These methodologies aligned with Computer-Aided Software Engineering (CASE) tools like Rational Rose and Visual Paradigm, which promised to transform software development through visual modeling and automated code generation. On larger projects, I spent months creating elaborate UML diagrams—use cases, class diagrams, sequence diagrams, and more. Some CASE tools I used could generate code frameworks from these models and even reverse-engineer models from existing code, promising a synchronized relationship between design and implementation. The reality proved disappointing; generated code was often rigid and difficult to maintain, while keeping models and code in sync became an exercise in frustration. The agile movement ultimately eclipsed both heavyweight methodologies and comprehensive CASE tools, emphasizing working software over comprehensive documentation.
DevOps Evolution
Version Control System
My introduction to version control came at a high-energy physics lab, where projects used primitive systems like RCS (Revision Control System) and SCCS (Source Code Control System). These early tools stored delta changes for each file and relied on exclusive locking mechanisms—only one developer could edit a file at a time. As development teams grew, most projects migrated to CVS (Concurrent Versions System), which built upon RCS foundations. CVS supported networked operations, allowing developers to commit changes from remote locations, and replaced exclusive locking with a more flexible concurrent model. However, CVS still operated at the file level rather than treating commits as project-wide transactions, leading to potential inconsistencies when only portions of related changes were committed. I continued using CVS for years until Subversion emerged as its logical successor. Subversion’s introduction of atomic commits to ensure that either all or none of a change would be committed. It also improved branching operations, directory management, and file metadata handling, addressing many of CVS’s limitations. While working at a travel company, I encountered AccuRev, which introduced the concept of “streams” instead of traditional branches. This approach modeled development as flowing through various stages. AccuRev proved particularly valuable for managing offshore development teams who needed to download large codebases over unreliable networks and its sophisticated change management reduced bandwidth requirements.
During my time at a large online retailer in the mid-2000s, I worked with Perforce, a system optimized for large-scale development with massive codebases and binary assets. Perforce’s performance with large files and sophisticated security model made it ideal for enterprise environments. I briefly used Mercurial for some projects, appreciating its simplified interface compared to early Git versions, before ultimately transitioning to Git as it became the industry standard. This evolution of version control parallels the increasing complexity of software development itself: from single developers working on isolated files to globally distributed teams collaborating on massive codebases.
Build Systems
I have been using Make probably throughout my career across various platforms and languages. Its declarative approach to defining dependencies and build rules established patterns that influence build tools to this day. After adopting Java ecosystem, I switched to Apache Ant, which used XML to define build tasks as an explicit sequence of operations. This offered greater flexibility and cross-platform consistency but at the cost of increasingly verbose build files as projects grew more complex. I used Ant extensively during Java’s enterprise ascendancy, customizing its tasks to handle deployment, testing, and reporting. I then adopted Maven that introduced revolutionary concepts such as convention-over-configuration philosophy with standardized project structures, dependency management capabilities connected to remote repositories to automatically resolve and download required libraries. Despite Maven’s transformative nature, its rigid conventions and complex XML configuration was a bit frustrating and I later switched to Gradle. Gradle offered Maven’s dependency management with a Groovy-based DSL that provided both the structure of declarative builds and the flexibility of programmatic customization.
The build process expanded beyond compilation when I implemented Continuous Integration using CruiseControl, an early CI server developed by ThoughtWorks. This system automatically triggered builds on code changes, ran tests, and reported results. Later, I worked extensively with Hudson, which offered a more user-friendly interface and plugin architecture for extending CI capabilities. When Oracle acquired Sun and attempted to trademark the Hudson name, the community rallied behind a fork called Jenkins, which rapidly became the dominant CI platform. I used Jenkins for years, creating complex pipelines that automated testing, deployment, and release processes across multiple projects and environments. Eventually, I transitioned to cloud-based CI/CD platforms that integrated more seamlessly with hosted repositories and containerized deployments.
Summary
As I look back across my three decades in technology, these obsolete systems and abandoned platforms aren’t just nostalgic relics—they tell a powerful story about innovation, market forces, and the unpredictable nature of technological evolution. The technologies I’ve described throughout this blog didn’t disappear because they were fundamentally flawed. Pascal offered cleaner syntax than C, BeOS was more elegant than Windows, and CORBA attempted to solve distributed computing problems we still grapple with today. Borland’s superior development tools lost to Microsoft’s ecosystem advantages. Object-oriented databases, despite solving real problems, couldn’t overcome the momentum of relational systems. Yet these extinct technologies left lasting imprints on our industry. Anders Hejlsberg, who created Turbo Pascal, went on to shape C# and TypeScript. The clean design principles of BeOS influenced aspects of modern operating systems. Ideas don’t die—they evolve and find new expressions in subsequent generations of technology.
Perhaps the most valuable lesson is about technological adaptability. Throughout my career, the skills that have remained relevant weren’t tied to specific languages or platforms, but rather to fundamental concepts: understanding data structures, recognizing patterns in system design, and knowing when complexity serves a purpose versus when it becomes a hurdle. The industry’s constant reinvention ensures that many of today’s dominant technologies will eventually face their own extinction event. By understanding the patterns of the past, we gain insight into which current technologies might have staying power. This digital archaeology isn’t just about honoring what came before—it’s about understanding the cyclical nature of our industry and preparing for what comes next.
Distributed systems inherently involve multiple components such as services, databases, networks, etc., which are spread across different machines or locations. These systems are prone to partial failures, where one part of the system may fail while others remain operational. A common strategy for building fault-tolerant and resilient systems is to recover from transient failures by retrying failed operations. Here are some common use cases for implementing retries to maintain reliability in such environments:
Recover from Transient Failures such as network glitches, dropped packets, or temporary unavailability of services. These failures are often short-lived, and a simple retry may succeed without any changes to the underlying system.
Recover from Network Instability due to packet loss, latency, congestion, or intermittent connectivity can disrupt communication between services.
Recover from Load Shedding or Throttling where services may experience momentary overloads and are unable to handle incoming requests.
Asynchronous Processing or Eventual Consistency models may take time to converge state across different nodes or services and operations might fail temporarily if the system is in an intermediate state.
Fault Isolation in microservices architectures, where services are loosely coupled but depend on one another. The downstream services may fail temporarily due to a service restart, deployment or scaling activities.
Service Downtime affects availability of services but client application can use retries to recover from minor faults and maintain availability.
Load Balancing and Failover with redundant Zones/Regions so that when a request to one zone/region fails but can be handled by another healthy region or zone.
Partial Failures where one part of the system fails while the rest remains functional (partial failures).
Build System Resilience to allow the system to self-heal from minor disruptions.
Race Conditions or timing-related issues in concurrent systems can be resolved with retries.
Challenges with Retries
Retries help in recovering from transient or partial failures by resending requests, but they can worsen system overloads if not managed carefully. Here are some challenges associated with retries:
Retry Storms: A retry storm occurs when multiple clients or services simultaneously retry failed requests to an overloaded or recovering service. This flood of retries can exacerbate the problem and can lead to performance degradation or a self-inflicted Denial of Service (DoS) attack.
Idempotency and Data Consistency: Some operations are not idempotent and performing them multiple times can lead to inconsistent or incorrect results (e.g., processing a financial transaction multiple times).
Cascading Failures: Retrying can propagate failures upstream or to dependent services. For instance, when a service fails and clients retry excessively, which can overwhelm downstream services.
Latency Amplification: Retrying failed operations can increase end-to-end latency, as each retry adds a delay before successful resolution.
Amplified Resource Consumption: Retried operations consume additional CPU, memory, and bandwidth, potentially depleting resources at a faster rate. Even when services eventually succeed, the increased load from retries can harm the overall system.
Retry Loops or Infinite Retries: If a failed operation is retried continuously without ever succeeding, it can potentially lead to system crashes.
Threads and connections starvation: When a service invokes multiple operations and some fail, it may retry all operations, leading to increased overall request latency. If high timeouts are set, threads and connections remain occupied, blocking new traffic.
Unnecessary Retries on Non-Retryable Failures: Retrying certain types of failures, like authorization errors or malformed requests is unnecessary and wastes system resources.
Timeout Mismatch Between Services: If the timeout settings for retries between services are not aligned, a downstream service may still be processing a request while the upstream service retries or times out that can result in conflicting states.
Considerations for Retries
Here are some key considerations and best practices for implementing more effective and safer retry mechanisms in distributed systems, enhancing resilience while safeguarding system stability during periods of stress or failure:
Timeouts: Implement timeouts to prevent clients from waiting indefinitely for a response and reduce resource exhaustion (e.g., memory or threads) caused by prolonged waiting. The challenge lies in selecting the appropriate timeout value: if set too high, resources are wasted; if set too low, it can trigger excessive retries, which increases the risk of outages. It’s recommended to set timeouts that are tightly aligned with performance expectations, ideally less than 2-times your maximum response time to avoid thread starvation. Additionally, monitor for early warning signs by setting alarms when performance degrades (e.g., when P99 latency approaches 50% of the timeout value).
Timeout Budgeting: In complex distributed systems, timeout budgeting ensures that the total time taken by a request across multiple services doesn’t exceed an acceptable limit. Each downstream service gets a portion of the total timeout, so failure in one service doesn’t excessively delay the entire request chain.
Exponential Backoff: Implement exponential backoff to spread out retry attempts by gradually increasing the delay between retries, reducing the risk of overwhelming a failing component and allowing time for recovery. It’s important to cap the backoff duration and limit the total number of retries. Without these limits, the system might continue retrying unnecessarily even after the underlying issue has been resolved.
Jitter: Adding randomness (jitter) to the backoff process helps prevent synchronized retries that could lead to overload spikes. Jitter is useful for spreading out traffic spikes and periodic tasks to avoid large bursts of traffic at regular intervals for improving system stability.
Idempotency: Operations that are retried must be idempotent, meaning they can be safely repeated without causing unintended side effects (e.g., double payments or duplicated data).
Retry Limits: Retries should be capped at a certain limit to avoid endlessly retrying a failing operation. Retries should stop beyond a certain number of attempts and the failure should be escalated or reported.
Throttling and Rate Limiting: Implement throttling or rate limiting and control the number of requests a service handles within a given time period. Rate limiting can be dynamic, which is adjusted based on current load or error rates, and avoid system overloads during traffic spikes. In addition, low-priority requests can be shed during high load situations.
Error Categorization: Not all errors should trigger retries and use an allowlist for known retryable errors and only retry those. For example, 400 Bad Request (indicating a permanent client error) due to invalid input should not be retried, while server-side or network-related errors with a 500 Internal Server Error (a likely transient issue) can benefit from retrying.
Targeting Failing Components Only: In a partial failure, not all parts of the system are down and retries help isolate and recover from the failing components by retrying operations specifically targeting the failed resource. For example, if a service depends on multiple microservices for an operation and one of the service fails, the system should retry the failed request without repeating the entire operation.
Intelligent and Adaptive Retries: Design retry logic to take the system’s current state into account, such as checking service health or load conditions before retrying. For example, increase retry intervals if multiple components are detected as failing or retry quickly for timeout errors but back off more for connection errors.. This prevents retries when the system is already known to be overloaded.
Retrying at Different Levels: Retries can be implemented at various levels to handle partial failures such as application level, middleware/proxy (load-balancer or API gateway), transport level (network). For example, a distributed system using a load balancer can detect if a specific instance of a service is failing and reroute traffic to a healthy instance that triggers retries only for the requests that target the failing instance.
Retry Amplification: In multi-tiered architectures, if retries are implemented at each level of nested service calls, it can lead to increased latency and exponentially higher traffic. To mitigate this, implement retries only at critical points in the call chain, and ensure that each service has a clear retry policy with limits. Use short timeouts to prevent thread starvation when calls to downstream services take too long. If too many threads hang, new traffic will be blocked.
Retry Budget: Implementing a global limit on the number of retries across all operations helps prevent system overload. For example, using an algorithm like Leaky Bucket can regulate the number of retries within a specified time period. This ensures that retries are distributed evenly and don’t exceed system capacity, preventing resource exhaustion during high failure rates.
Retries with Circuit Breakers: The circuit breaker pattern can be combined with retries to avoid overwhelming a failing component. When a service starts failing, the circuit breaker opens, temporarily halting requests to that service until it is healthy again. Retries can be configured to happen only after the circuit breaker transitions to a half-open state, which allows a limited number of retries to test if the service has recovered.
Retries with Failover Mechanisms: Retries can be designed with failover strategies where the system switches to a backup service, region, or replica in case of partial failure. If a service in one region fails then the retries can redirect requests to a different region or zone for ensuring availability.
Latency Sensitivity: Services with strict latency requirements might not tolerate long backoff periods or extended retries so they should minimize number of retries and cap backoff times.
Sync Calls: For synchronous calls, retry once immediately to handle temporary network issues and avoid multiple retries that could lead to thread starvation. Avoid excessive sleeping of threads between retries, which can lead to thread starvation. Also, a Circuit Breaker can be used to prevent retrying if a high percentage of calls fail.
Async Calls: Use exponential backoff with jitter for asynchronous operations and use Circuit Breakers to stop retries when failure rates are high. Asynchronous APIs can queue requests for later retries, but should incorporate health checks to ensure that retry attempts don’t add excessive load to downstream services during recovery periods.
Retrying on Overload Responses: Recognize overload indicators (e.g., HTTP 503 responses) and avoid retries when the response indicates overload.
Fail-Fast: Detect issues early and fails quickly rather than continuing to process failing requests or operations to avoid wasting time on requests that are unlikely to succeed.
Graceful Degradation: Provide an alternative method of handling requests when a service fails. For example, if a primary service is down, a cached result or a simpler backup service can be used instead.
Downstream Bugs: Rather than implementing retry-based workarounds, prioritize having downstream service owners address and resolve the underlying issues.
Monitor and Analyze Retry Patterns: Implement monitoring for retry attempts and success rates, and analyze the data to gain insights into system behavior during failures. Use these insights to optimize retry strategies, such as adjusting backoff intervals and fine-tuning timeouts for improved system performance.
SLAs with Downstream Services: Establish clear service-level agreements (SLAs) with downstream services about call frequency, failure rates, and latency expectations.
Availability Over Consistency: Prioritize service availability over consistency where possible, especially during retries or failure handling. In such cases, retries might return stale data or cause inconsistency issues, so it’s crucial to align retry policies with system design.
Chaos Engineering: Chaos engineering involves intentionally injecting failures, such as server crashes or network disruptions, into a system to test its resilience under adverse conditions. By simulating real-world failures, teams can identify weaknesses and ensure that the retry policies are working as expected.
Bulkhead Pattern: The bulkhead pattern isolates different parts of a system to prevent a failure in one part from affecting the rest of the system. The bulkheads can be implemented by limiting the number of resources (threads, memory, connections) allocated to each service or subsystem so that if one service becomes overloaded or fails, it won’t exhaust resources that other services need.
System Design: It’s essential to design APIs to minimize unnecessary communication with the server. For instance, in an event-driven architecture, if an event is missing a required attribute, the application might need to make additional requests to retrieve that data, increasing system load. To avoid this, ensure that events are fully populated with all necessary information upfront.
Summary
Retries are an essential mechanism for building fault-tolerant distributed systems and to recover from transient failures such as network issues, service unavailability, and partial system outages. A well-implemented retry strategy improves system resilience by ensuring that temporary failures don’t lead to full-blown outages. Techniques such as exponential backoff with jitter, idempotency, token buckets to limit retries locally, and circuit breakers help manage retries effectively, preventing issues like retry storms, resource exhaustion, and latency amplification.
However, retries need careful management because without proper limits, retries can overwhelm services that are already struggling or exacerbate issues like cascading failures and thread starvation. Incorporating timeouts, retry limits, and adaptive retry mechanisms based on system health can prevent these negative side effects. By analyzing retry patterns and adopting error-specific handling strategies, distributed systems can strike a balance between availability and resource efficiency, and ensures robust performance even in the face of partial failures.