Shahzad Bhatti Welcome to my ramblings and rants!

September 15, 2025

How Duplicate Detection Became the Dangerous Impostor of True Idempotency

Filed under: Computing,Design — admin @ 1:21 pm

A few years ago, I transferred funds from my bank to one of largest cryptocurrency exchange in US but I noticed my bank account was charged twice. The exchange’s support team was… unhelpful. “Our system shows one transaction,” they insisted. After a week of back-and-forth calls and escalations, they quietly reversed the duplicate charge. This wasn’t an isolated incident, I occasionally see duplicate charges on my credit cards activities. They usually get “fixed automatically in a few days,” but that’s not the point. These aren’t edge cases—they’re symptoms of a fundamental misunderstanding about what idempotency actually means.

Most companies don’t write post-mortems about idempotency failures because they rarely cause outages. Instead, they cause something worse: data corruption, duplicate transactions, and the slow erosion of customer trust. At one trading company, we had duplicate orders execute for millions of dollars during a trading session. We manually caught it and reversed the duplicates with weeks of cleanup, but it was a wake-up call about how dangerous these silent failures can be. The same system also had an aggressive request fingerprinting to prevent duplicates and occasionally it would reject legitimate rapid trades during volatile markets. At another investment firm, a cloud outage forced us to replay thousands of failed order messages. The replay worked as designed for the immediate failure, but it created duplicate entries in downstream systems. What should have been automatic recovery turned into days of manual data cleanup.

The problem isn’t that idempotency is hard to implement. It’s that most engineers fundamentally misunderstand what it means, conflating it with basic duplicate detection and implementing dangerous “check-then-act” patterns that create race conditions.

The Idempotency Illusion

Ask ten engineers to implement idempotency, and you’ll get eleven different implementations. True idempotency means that performing an operation multiple times has the same effect as performing it once, returning the exact same response every time. Not “detecting duplicates.” Not “rejecting retries.” The same effect with the same response—including status codes, headers, and body. Here’s the brutal truth: if your API returns 201 Created on the first call and 409 Conflict on the retry, it’s not idempotent. If it returns different response bodies for the same request, it’s not idempotent. And if two concurrent requests with the same idempotency key can both succeed, you don’t have idempotency—you have a race condition wearing a disguise.

The Twelve Deadly Anti-Patterns

Anti-Pattern 1: Server-Generated Idempotency Keys

This might be the most insidious anti-pattern because it seems logical at first glance:

# THIS IS FUNDAMENTALLY BROKEN - DON'T DO THIS
def create_order(request):
    # Generate key from request parameters
    idempotency_key = hash(f"{request.user_id}:{request.symbol}:{request.quantity}:{datetime.now().date()}")
    
    if cache.exists(idempotency_key):
        return cache.get(idempotency_key)
    
    # Process order...

This prevents legitimate duplicate business operations. A trader trying to buy 100 shares of AAPL twice in the same day gets blocked. At the trading firm I mentioned, they implemented time-windowed keys for “duplicate detection,” using a small time windows. During volatile markets, traders executing rapid legitimate trades were blocked because the system thought they were duplicates.

The fundamental issue: server-generated keys conflate “retry” with “duplicate business operation.” Only the client knows the difference.

The Fix: Idempotency keys MUST be client-generated, period.

// CORRECT: Client generates unique key per logical operation
const idempotencyKey = uuidv4();

// First attempt
await api.createOrder({ symbol: 'AAPL', qty: 100 }, { 
    headers: { 'Idempotency-Key': idempotencyKey }
});

// Network timeout - safe retry with SAME key
await api.createOrder({ symbol: 'AAPL', qty: 100 }, { 
    headers: { 'Idempotency-Key': idempotencyKey }
});

// New order - generates NEW key
const newKey = uuidv4();
await api.createOrder({ symbol: 'AAPL', qty: 100 }, { 
    headers: { 'Idempotency-Key': newKey }
});

Anti-Pattern 2: The “Check-Then-Act” Race Condition

This is the most common pattern I see in production codebases:

# THIS HAS A CRITICAL RACE CONDITION
def create_payment(request, idempotency_key):
    # Check if we've seen this key before
    existing = db.query("SELECT * FROM payments WHERE idempotency_key = ?", idempotency_key)
    if existing:
        return existing
    
    # RACE CONDITION: Another request can execute between check and insert!
    payment = process_payment(request)
    payment.idempotency_key = idempotency_key
    db.save(payment)
    return payment

Here’s exactly what happens in the race condition window:

10:00:01.100 - Request A checks: key not found ?
10:00:01.150 - Request B checks: key not found ?  
10:00:01.200 - Request A processes payment: $1000 charged
10:00:01.250 - Request B processes payment: $1000 charged AGAIN
10:00:01.300 - Request A saves key
10:00:01.350 - Request B saves key (overwrites A)

Customer sees: $2000 charged instead of $1000
System logs: Everything looks normal

The Fix: Use atomic operations or database transactions. The complete implementation is in src/lib.rs and src/sqlite_store.rs in my GitHub project.

Anti-Pattern 3: Not Handling Concurrent In-Progress Requests

// THIS DOESN'T HANDLE CONCURRENT REQUESTS PROPERLY
func HandleRequest(key string, req Request) Response {
    if cached := cache.Get(key); cached != nil {
        return cached
    }
    
    // What if another request with same key arrives NOW?
    result := processRequest(req)
    cache.Set(key, result)
    return result
}

When a request takes 5 seconds to process and a client retries after 2 seconds, both requests execute the business logic. This is exactly what happened in my duplicate payment scenarios.

The Fix: Return a specific status for in-progress requests:

match record.status {
    Status::Pending => {
        // Return 409 Conflict with Retry-After header
        return Err(ApiError::RequestInProgress { retry_after: 2 });
    }
    Status::Completed => {
        // Return the cached response
        return Ok(record.cached_response);
    }
}

See examples/axum_server.rs for a complete integration example.

Anti-Pattern 4: Optional Idempotency Keys

// THIS IS WRONG - Makes idempotency optional
message CreatePaymentRequest {
    optional string idempotency_key = 1;  // WRONG!
    required string amount = 2;
}

Making idempotency keys optional is like making seatbelts optional—technically possible, but you’ll regret it when things go wrong.

The Fix:

// CORRECT - Required for all mutating operations
message CreatePaymentRequest {
    required string idempotency_key = 1;  // Client MUST provide
    required string amount = 2;
}

For REST APIs, return 400 Bad Request if the Idempotency-Key header is missing on POST/PUT/PATCH requests.

Anti-Pattern 5: Not Preserving Original Failed Responses

// WRONG - Doesn't cache failures
if (result.isSuccess()) {
    cache.put(key, result);
    return result;
} else {
    // Not caching failures means retries might succeed!
    return result;
}

A validation error (400) on the first attempt might pass on retry if validation rules change or external state changes. This creates inconsistent behavior that’s impossible to debug.

The Fix: Cache deterministic failures:

  • Always cache: 2xx success responses and 4xx client errors
  • Never cache: 5xx server errors (allow retries)
  • Consider caching: Business logic failures like insufficient funds

Anti-Pattern 6: Using Non-ACID Storage for Idempotency Keys

Using eventually consistent stores like DynamoDB (without strong consistency) or Cassandra creates race conditions even with “correct” code:

Request 1 arrives ? Check key in DynamoDB ? Key not found (stale read)
Request 2 arrives ? Check key in DynamoDB ? Key not found (stale read)
Both requests process ? DUPLICATE TRANSACTION!

Amazon was one of the first major companies to adopt NoSQL at scale, using it for their shopping cart system. In the early days, I recall seeing items duplicate in my cart or mysteriously disappear and reappear. Amazon eventually solved this by moving to stronger consistency models for critical operations and implementing sophisticated conflict resolution.

Required Properties:

  • Strong Consistency: Reads must see all previously committed writes
  • Atomic Compare-and-Set: INSERT IF NOT EXISTS must be atomic
  • Transaction Support: Key insertion and business logic must be atomic

The Fix: Use ACID-compliant stores like PostgreSQL, MySQL, or Redis with Lua scripts. The src/sqlite_store.rs implementation shows the correct pattern.

Anti-Pattern 7: Orphaned “PENDING” States Without Recovery

When servers crash mid-processing, PENDING records become eternal blockers:

// Server inserts PENDING record
idempotencyStore.insert(key, "PENDING");
// SERVER CRASHES HERE
processPayment(); // Never executed
idempotencyStore.update(key, "COMPLETED"); // Never reached

This blocks all future retries indefinitely—a silent killer that’s hard to detect until customers complain.

The Fix: Implement timeout-based recovery:

if record.status == Status::Pending {
    if record.locked_until < now() {
        // Expired PENDING - safe to retry
        return Ok(LockResult::Acquired);
    } else {
        // Still processing
        return Ok(LockResult::InProgress { retry_after: 30 });
    }
}

Anti-Pattern 8: Missing Request Fingerprinting

Without request fingerprinting, a client bug can reuse a key with different payloads:

# Same key, different amounts - should be rejected!
create_payment(key="abc123", amount=100)  # First request
create_payment(key="abc123", amount=200)  # Bug: reused key with different amount

The server sees the cached key, assumes it’s a retry, and returns the first response ($100 charged) while the client thinks it charged $200.

The Fix: Generate and verify request fingerprints:

pub fn generate_fingerprint<T: Serialize>(request: &T) -> String {
    let json = serde_json::to_string(request).unwrap_or_default();
    let mut hasher = Sha256::new();
    hasher.update(json.as_bytes());
    format!("{:x}", hasher.finalize())
}

The complete implementation is in src/lib.rs.

Anti-Pattern 9: Ambiguous Infrastructure Failure Handling

When the idempotency store itself fails (network timeout, database down), services lack a consistent strategy:

// WRONG - Ambiguous error handling
if let Err(e) = IsIdempotentCreateTrade(...) {
    return err; // Is this a duplicate or a DB failure? Different handling needed!
}

The Fix: Always fail-closed for financial operations:

  • Return 503 Service Unavailable for infrastructure failures
  • Return 409 Conflict for duplicates
  • Include retry-after headers when appropriate

Anti-Pattern 10: Missing Transaction Rollback on Idempotency Save Failure

// BROKEN - Business logic succeeds but idempotency save fails
dbTransaction1.begin();
processPayment(); // SUCCESS
dbTransaction1.commit(); 

// Separate transaction for idempotency (WRONG!)
dbTransaction2.begin();
saveIdempotencyRecord(key, response); // FAILS!
dbTransaction2.commit(); 

// Now payment processed but not recorded as idempotent

The Fix: Everything in one transaction. See src/sqlite_store.rs for the atomic pattern.

Anti-Pattern 11: Insufficient Idempotency Windows

Purging idempotency records too quickly breaks realistic retry scenarios:

  • Mobile apps with poor connectivity might retry after 5 minutes
  • Batch jobs might retry failed records after 1 hour
  • Manual intervention might happen the next business day

The Fix: Follow Stripe’s 24-hour retention window. Balance storage costs with real-world retry patterns.

Anti-Pattern 12: No Correlation Between Related Idempotent Operations

Complex workflows require multiple idempotent operations, but there’s no way to track their relationship:

Create Order (key-1) ? Charge Payment (key-2) ? Allocate Inventory (key-3)

If step 2 fails, how do you retry the entire workflow without duplicating step 1?

The Fix: Implement workflow-level idempotency that tracks related operations and allows resumption from failure points.

The Correct Implementation: Following Stripe’s Pattern

After analyzing production failures across multiple companies, I built a complete implementation following Stripe’s proven patterns. The core insight is that idempotency requires atomic lock acquisition:

// From src/lib.rs - The correct atomic pattern
pub async fn process_request<Req, Res, F, Fut>(
    &self,
    idempotency_key: Option<String>,
    user_id: String,
    request_path: String,
    request_method: String,
    request: &Req,
    handler: F,
) -> Result<CachedResponse, IdempotencyError>
{
    let key = idempotency_key.ok_or(IdempotencyError::MissingIdempotencyKey)?;
    let fingerprint = Self::generate_fingerprint(request);
    
    // Step 1: Atomically try to acquire lock
    let lock_result = self.store.try_acquire_lock(record).await?;
    
    match lock_result {
        LockResult::Acquired => {
            // Execute business logic
            match handler().await {
                Ok((status_code, headers, response)) => {
                    // Atomically complete the request
                    self.store.complete_with_response(
                        &key, &user_id, final_status, Some(cached_response)
                    ).await?;
                    Ok(cached_response)
                }
                Err(e) => {
                    // Release lock to allow retry
                    self.store.release_lock_on_failure(&key, &user_id, true, None).await?;
                    Err(e)
                }
            }
        }
        LockResult::AlreadyCompleted(response) => Ok(response),
        LockResult::InProgress { retry_after } => {
            Err(IdempotencyError::RequestInProgress { retry_after })
        }
        LockResult::KeyReused => {
            Err(IdempotencyError::KeyReusedWithDifferentRequest)
        }
    }
}

The complete implementation includes:

The Core Implementation

Here’s the actual sample code that implements the atomic pattern correctly:

Core Middleware (src/lib.rs)

use async_trait::async_trait;
use chrono::{DateTime, Duration, Utc};
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
use std::collections::HashMap;
use thiserror::Error;
use uuid::Uuid;

#[derive(Error, Debug)]
pub enum IdempotencyError {
    #[error("Request in progress (retry after {retry_after} seconds)")]
    RequestInProgress { retry_after: u64 },

    #[error("Idempotency key reused with different request")]
    KeyReusedWithDifferentRequest,

    #[error("Missing idempotency key")]
    MissingIdempotencyKey,

    #[error("Storage error: {0}")]
    StorageError(String),

    #[error("Invalid idempotency key format")]
    InvalidKeyFormat,

    #[error("Transaction failed: {0}")]
    TransactionFailed(String),

    #[error("Concurrent request conflict")]
    ConcurrentRequestConflict,

    #[error("Handler execution failed: {0}")]
    HandlerFailed(String),
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub enum IdempotencyStatus {
    Pending,
    Completed,
    Failed { is_retryable: bool },
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct IdempotencyRecord {
    pub key: String,
    pub user_id: String,  // Scope keys to user/tenant
    pub request_path: String,
    pub request_method: String,
    pub request_fingerprint: String,
    pub status: IdempotencyStatus,
    pub response: Option<CachedResponse>,
    pub created_at: DateTime<Utc>,
    pub expires_at: DateTime<Utc>,
    pub locked_until: Option<DateTime<Utc>>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CachedResponse {
    pub status_code: u16,
    pub headers: HashMap<String, String>,
    pub body: Vec<u8>,
}

/// Result of attempting to acquire an idempotency lock
#[derive(Debug)]
pub enum LockResult {
    /// Lock acquired successfully, safe to proceed
    Acquired,
    /// Request already completed, return cached response
    AlreadyCompleted(CachedResponse),
    /// Request is currently being processed by another worker
    InProgress { retry_after: u64 },
    /// Key reused with different request payload
    KeyReused,
    /// Failed permanently, return cached error response
    FailedPermanently(CachedResponse),
}

/// Trait for idempotency storage backends
#[async_trait]
pub trait IdempotencyStore: Send + Sync {
    /// Atomically attempt to acquire a lock for processing
    /// This must be an atomic operation that either:
    /// 1. Creates a new PENDING record and returns Acquired
    /// 2. Returns the current state if record exists
    async fn try_acquire_lock(
        &self,
        record: IdempotencyRecord,
    ) -> Result<LockResult, IdempotencyError>;

    /// Atomically update record with final result and release lock
    /// This must happen in a single transaction with business logic
    async fn complete_with_response(
        &self,
        key: &str,
        user_id: &str,
        status: IdempotencyStatus,
        response: Option<CachedResponse>,
    ) -> Result<(), IdempotencyError>;

    /// Atomically release lock on failure (for retryable errors)
    async fn release_lock_on_failure(
        &self,
        key: &str,
        user_id: &str,
        is_retryable: bool,
        response: Option<CachedResponse>,
    ) -> Result<(), IdempotencyError>;

    /// Get a record by key and user_id (for debugging/monitoring)
    async fn get(
        &self,
        key: &str,
        user_id: &str,
    ) -> Result<Option<IdempotencyRecord>, IdempotencyError>;

    /// Delete expired records (maintenance operation)
    async fn cleanup_expired(&self) -> Result<usize, IdempotencyError>;

    /// Execute within a transaction (for stores that support it)
    async fn execute_in_transaction<F, T>(&self, f: F) -> Result<T, IdempotencyError>
    where
        F: FnOnce() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, IdempotencyError>> + Send>> + Send,
        T: Send;
}

/// Main idempotency middleware
#[derive(Clone)]
pub struct IdempotencyMiddleware<S: IdempotencyStore + Clone> {
    store: S,
    ttl: Duration,
    lock_timeout: Duration,
}

impl<S: IdempotencyStore + Clone> IdempotencyMiddleware<S> {
    pub fn new(store: S) -> Self {
        Self {
            store,
            ttl: Duration::hours(24),  // Stripe's 24-hour retention
            lock_timeout: Duration::seconds(30),  // Max time to hold lock
        }
    }

    pub fn with_config(store: S, ttl: Duration, lock_timeout: Duration) -> Self {
        Self {
            store,
            ttl,
            lock_timeout,
        }
    }

    /// Get access to the underlying store (for testing)
    #[cfg(test)]
    pub fn store(&self) -> &S {
        &self.store
    }

    /// Validate idempotency key format (UUID v4)
    fn validate_key(key: &str) -> Result<(), IdempotencyError> {
        Uuid::parse_str(key)
            .map_err(|_| IdempotencyError::InvalidKeyFormat)?;
        Ok(())
    }

    /// Generate request fingerprint using SHA-256
    pub fn generate_fingerprint<T: Serialize>(request: &T) -> String {
        let json = serde_json::to_string(request).unwrap_or_default();
        let mut hasher = Sha256::new();
        hasher.update(json.as_bytes());
        format!("{:x}", hasher.finalize())
    }

    /// Process a request with idempotency guarantees
    /// This implements the correct atomic pattern to avoid all race conditions
    pub async fn process_request<Req, Res, F, Fut>(
        &self,
        idempotency_key: Option<String>,
        user_id: String,
        request_path: String,
        request_method: String,
        request: &Req,
        handler: F,
    ) -> Result<CachedResponse, IdempotencyError>
    where
        Req: Serialize,
        Res: Serialize,
        F: FnOnce() -> Fut,
        Fut: std::future::Future<Output = Result<(u16, HashMap<String, String>, Res), IdempotencyError>>,
        S: Clone,
    {
        // Require idempotency key for mutating operations
        let key = idempotency_key
            .ok_or(IdempotencyError::MissingIdempotencyKey)?;

        Self::validate_key(&key)?;

        let fingerprint = Self::generate_fingerprint(request);
        let now = Utc::now();

        // Create the record we want to insert
        let record = IdempotencyRecord {
            key: key.clone(),
            user_id: user_id.clone(),
            request_path: request_path.clone(),
            request_method: request_method.clone(),
            request_fingerprint: fingerprint.clone(),
            status: IdempotencyStatus::Pending,
            response: None,
            created_at: now,
            expires_at: now + self.ttl,
            locked_until: Some(now + self.lock_timeout),
        };

        // Step 1: Atomically try to acquire lock
        let lock_result = self.store.try_acquire_lock(record).await?;

        match lock_result {
            LockResult::Acquired => {
                // We got the lock - safe to proceed with business logic
                tracing::debug!("Lock acquired for key: {}", key);
                
                // Execute business logic
                match handler().await {
                    Ok((status_code, headers, response)) => {
                        // Success - cache the response
                        let response_body = serde_json::to_vec(&response)
                            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

                        let cached_response = CachedResponse {
                            status_code,
                            headers,
                            body: response_body,
                        };

                        // Determine final status based on HTTP status code
                        let final_status = if status_code >= 500 {
                            IdempotencyStatus::Failed { is_retryable: true }
                        } else if status_code >= 400 {
                            IdempotencyStatus::Failed { is_retryable: false }
                        } else {
                            IdempotencyStatus::Completed
                        };

                        // Atomically complete the request
                        self.store.complete_with_response(
                            &key,
                            &user_id,
                            final_status,
                            Some(cached_response.clone()),
                        ).await?;

                        tracing::debug!("Request completed successfully for key: {}", key);
                        Ok(cached_response)
                    }
                    Err(e) => {
                        // Handler failed - determine if retryable
                        let is_retryable = match &e {
                            IdempotencyError::StorageError(_) => true,
                            IdempotencyError::TransactionFailed(_) => true,
                            IdempotencyError::HandlerFailed(_) => true,
                            _ => false,
                        };

                        // Release lock to allow retry
                        self.store.release_lock_on_failure(
                            &key,
                            &user_id,
                            is_retryable,
                            None, // No response to cache for errors
                        ).await?;

                        tracing::warn!("Handler failed for key: {} - error: {}", key, e);
                        Err(e)
                    }
                }
            }
            LockResult::AlreadyCompleted(response) => {
                // Request was already processed successfully
                tracing::debug!("Returning cached response for key: {}", key);
                Ok(response)
            }
            LockResult::InProgress { retry_after } => {
                // Another request is currently processing this key
                tracing::debug!("Request in progress for key: {}, retry after: {}s", key, retry_after);
                Err(IdempotencyError::RequestInProgress { retry_after })
            }
            LockResult::KeyReused => {
                // Key was reused with different request payload
                tracing::warn!("Key reused with different request for key: {}", key);
                Err(IdempotencyError::KeyReusedWithDifferentRequest)
            }
            LockResult::FailedPermanently(response) => {
                // Request failed permanently, return cached error
                tracing::debug!("Returning cached permanent failure for key: {}", key);
                Ok(response)
            }
        }
    }
}

// Storage implementations
pub mod sqlite_store;

#[cfg(feature = "axum-integration")]
pub mod axum_integration;

#[cfg(feature = "grpc")]
pub mod grpc_integration;

// Re-export for convenience
pub use sqlite_store::SqliteIdempotencyStore;

#[cfg(test)]
mod tests;

Storage Backend (src/sqlite_store.rs)

use crate::{
    IdempotencyError, IdempotencyRecord, IdempotencyStatus, IdempotencyStore, 
    CachedResponse, LockResult
};
use async_trait::async_trait;
use chrono::{DateTime, Utc};
use sqlx::{Pool, Sqlite, SqlitePool, Row};
use std::sync::Arc;
use tokio::sync::Mutex;

#[derive(Clone)]
pub struct SqliteIdempotencyStore {
    pool: Pool<Sqlite>,
    // In-memory lock for the entire store to ensure atomicity
    transaction_lock: Arc<Mutex<()>>,
}

impl SqliteIdempotencyStore {
    pub async fn new(database_url: &str) -> Result<Self, IdempotencyError> {
        let pool = SqlitePool::connect(database_url)
            .await
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        // Create tables with proper indexes
        sqlx::query(
            r#"
            CREATE TABLE IF NOT EXISTS idempotency_records (
                key TEXT NOT NULL,
                user_id TEXT NOT NULL,
                request_path TEXT NOT NULL,
                request_method TEXT NOT NULL,
                request_fingerprint TEXT NOT NULL,
                status TEXT NOT NULL,
                response_status_code INTEGER,
                response_headers TEXT,
                response_body BLOB,
                created_at TEXT NOT NULL,
                expires_at TEXT NOT NULL,
                locked_until TEXT,
                PRIMARY KEY (key, user_id)
            );

            CREATE INDEX IF NOT EXISTS idx_expires_at ON idempotency_records(expires_at);
            CREATE INDEX IF NOT EXISTS idx_user_id ON idempotency_records(user_id);
            CREATE INDEX IF NOT EXISTS idx_locked_until ON idempotency_records(locked_until);
            "#
        )
        .execute(&pool)
        .await
        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        Ok(Self { 
            pool,
            transaction_lock: Arc::new(Mutex::new(())),
        })
    }

    fn serialize_status(status: &IdempotencyStatus) -> String {
        match status {
            IdempotencyStatus::Pending => "pending".to_string(),
            IdempotencyStatus::Completed => "completed".to_string(),
            IdempotencyStatus::Failed { is_retryable } => {
                format!("failed:{}", if *is_retryable { "retryable" } else { "permanent" })
            }
        }
    }

    fn deserialize_status(status: &str) -> IdempotencyStatus {
        match status {
            "pending" => IdempotencyStatus::Pending,
            "completed" => IdempotencyStatus::Completed,
            "failed:retryable" => IdempotencyStatus::Failed { is_retryable: true },
            "failed:permanent" => IdempotencyStatus::Failed { is_retryable: false },
            _ => IdempotencyStatus::Pending,
        }
    }

    async fn record_from_row(row: &sqlx::sqlite::SqliteRow) -> Result<IdempotencyRecord, IdempotencyError> {
        let status_str: String = row.try_get("status")
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
        let status = Self::deserialize_status(&status_str);

        let response = if let Some(status_code) = row.try_get::<Option<i32>, _>("response_status_code")
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))? 
        {
            let headers_json: Option<String> = row.try_get("response_headers")
                .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
            let headers = headers_json
                .and_then(|h| serde_json::from_str(&h).ok())
                .unwrap_or_default();

            let body: Option<Vec<u8>> = row.try_get("response_body")
                .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

            Some(CachedResponse {
                status_code: status_code as u16,
                headers,
                body: body.unwrap_or_default(),
            })
        } else {
            None
        };

        Ok(IdempotencyRecord {
            key: row.try_get("key").map_err(|e| IdempotencyError::StorageError(e.to_string()))?,
            user_id: row.try_get("user_id").map_err(|e| IdempotencyError::StorageError(e.to_string()))?,
            request_path: row.try_get("request_path").map_err(|e| IdempotencyError::StorageError(e.to_string()))?,
            request_method: row.try_get("request_method").map_err(|e| IdempotencyError::StorageError(e.to_string()))?,
            request_fingerprint: row.try_get("request_fingerprint").map_err(|e| IdempotencyError::StorageError(e.to_string()))?,
            status,
            response,
            created_at: {
                let dt_str: String = row.try_get("created_at").map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
                DateTime::parse_from_rfc3339(&dt_str)
                    .map_err(|e| IdempotencyError::StorageError(e.to_string()))?
                    .with_timezone(&Utc)
            },
            expires_at: {
                let dt_str: String = row.try_get("expires_at").map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
                DateTime::parse_from_rfc3339(&dt_str)
                    .map_err(|e| IdempotencyError::StorageError(e.to_string()))?
                    .with_timezone(&Utc)
            },
            locked_until: {
                let dt_str: Option<String> = row.try_get("locked_until").map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
                dt_str
                    .and_then(|s| DateTime::parse_from_rfc3339(&s).ok())
                    .map(|dt| dt.with_timezone(&Utc))
            },
        })
    }
}

#[async_trait]
impl IdempotencyStore for SqliteIdempotencyStore {
    /// Atomically attempt to acquire a lock for processing
    async fn try_acquire_lock(
        &self,
        record: IdempotencyRecord,
    ) -> Result<LockResult, IdempotencyError> {
        // Use a global lock to ensure atomicity (in production, rely on DB transactions)
        let _lock = self.transaction_lock.lock().await;
        
        let mut tx = self.pool.begin()
            .await
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        let now = Utc::now();

        // First, check if record exists
        let existing_row = sqlx::query(
            r#"
            SELECT key, user_id, request_path, request_method,
                   request_fingerprint, status, response_status_code,
                   response_headers, response_body, created_at,
                   expires_at, locked_until
            FROM idempotency_records
            WHERE key = ? AND user_id = ?
            "#
        )
        .bind(&record.key)
        .bind(&record.user_id)
        .fetch_optional(&mut *tx)
        .await
        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        let result = if let Some(row) = existing_row {
            let existing = Self::record_from_row(&row).await?;
            
            // Check fingerprint match
            if existing.request_fingerprint != record.request_fingerprint {
                Ok(LockResult::KeyReused)
            } else {
                // Check current status and lock
                match existing.status {
                    IdempotencyStatus::Completed => {
                        if let Some(response) = existing.response {
                            Ok(LockResult::AlreadyCompleted(response))
                        } else {
                            // If completed but no response, need to reprocess
                            // Update existing record to pending with new lock
                            sqlx::query(
                                r#"
                                UPDATE idempotency_records
                                SET status = ?, locked_until = ?, created_at = ?
                                WHERE key = ? AND user_id = ?
                                "#
                            )
                            .bind(Self::serialize_status(&IdempotencyStatus::Pending))
                            .bind(record.locked_until.map(|dt| dt.to_rfc3339()))
                            .bind(record.created_at.to_rfc3339())
                            .bind(&record.key)
                            .bind(&record.user_id)
                            .execute(&mut *tx)
                            .await
                            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

                            Ok(LockResult::Acquired)
                        }
                    }
                    IdempotencyStatus::Failed { is_retryable: false } => {
                        if let Some(response) = existing.response {
                            Ok(LockResult::FailedPermanently(response))
                        } else {
                            // If failed but no response, need to reprocess
                            // Update existing record to pending with new lock
                            sqlx::query(
                                r#"
                                UPDATE idempotency_records
                                SET status = ?, locked_until = ?, created_at = ?
                                WHERE key = ? AND user_id = ?
                                "#
                            )
                            .bind(Self::serialize_status(&IdempotencyStatus::Pending))
                            .bind(record.locked_until.map(|dt| dt.to_rfc3339()))
                            .bind(record.created_at.to_rfc3339())
                            .bind(&record.key)
                            .bind(&record.user_id)
                            .execute(&mut *tx)
                            .await
                            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

                            Ok(LockResult::Acquired)
                        }
                    }
                    IdempotencyStatus::Failed { is_retryable: true } => {
                        // Allow retry for retryable failures
                        // Update existing record to pending with new lock
                        sqlx::query(
                            r#"
                            UPDATE idempotency_records
                            SET status = ?, locked_until = ?, created_at = ?
                            WHERE key = ? AND user_id = ?
                            "#
                        )
                        .bind(Self::serialize_status(&IdempotencyStatus::Pending))
                        .bind(record.locked_until.map(|dt| dt.to_rfc3339()))
                        .bind(record.created_at.to_rfc3339())
                        .bind(&record.key)
                        .bind(&record.user_id)
                        .execute(&mut *tx)
                        .await
                        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

                        Ok(LockResult::Acquired)
                    }
                    IdempotencyStatus::Pending => {
                        // Check if lock is still active
                        if let Some(locked_until) = existing.locked_until {
                            if locked_until > now {
                                let retry_after = (locked_until - now).num_seconds() as u64;
                                Ok(LockResult::InProgress { retry_after })
                            } else {
                                // Lock expired, allow reprocessing
                                // Update existing record to pending with new lock
                                sqlx::query(
                                    r#"
                                    UPDATE idempotency_records
                                    SET status = ?, locked_until = ?, created_at = ?
                                    WHERE key = ? AND user_id = ?
                                    "#
                                )
                                .bind(Self::serialize_status(&IdempotencyStatus::Pending))
                                .bind(record.locked_until.map(|dt| dt.to_rfc3339()))
                                .bind(record.created_at.to_rfc3339())
                                .bind(&record.key)
                                .bind(&record.user_id)
                                .execute(&mut *tx)
                                .await
                                .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

                                Ok(LockResult::Acquired)
                            }
                        } else {
                            // No lock timeout, allow reprocessing
                            // Update existing record to pending with new lock
                            sqlx::query(
                                r#"
                                UPDATE idempotency_records
                                SET status = ?, locked_until = ?, created_at = ?
                                WHERE key = ? AND user_id = ?
                                "#
                            )
                            .bind(Self::serialize_status(&IdempotencyStatus::Pending))
                            .bind(record.locked_until.map(|dt| dt.to_rfc3339()))
                            .bind(record.created_at.to_rfc3339())
                            .bind(&record.key)
                            .bind(&record.user_id)
                            .execute(&mut *tx)
                            .await
                            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

                            Ok(LockResult::Acquired)
                        }
                    }
                }
            }
        } else {
            // Insert new record
            let status = Self::serialize_status(&record.status);
            let headers_json = record.response.as_ref()
                .map(|r| serde_json::to_string(&r.headers).unwrap_or_default());

            sqlx::query(
                r#"
                INSERT INTO idempotency_records (
                    key, user_id, request_path, request_method,
                    request_fingerprint, status, response_status_code,
                    response_headers, response_body, created_at,
                    expires_at, locked_until
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                "#
            )
            .bind(&record.key)
            .bind(&record.user_id)
            .bind(&record.request_path)
            .bind(&record.request_method)
            .bind(&record.request_fingerprint)
            .bind(status)
            .bind(record.response.as_ref().map(|r| r.status_code as i32))
            .bind(headers_json)
            .bind(record.response.as_ref().map(|r| r.body.clone()))
            .bind(record.created_at.to_rfc3339())
            .bind(record.expires_at.to_rfc3339())
            .bind(record.locked_until.map(|dt| dt.to_rfc3339()))
            .execute(&mut *tx)
            .await
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

            Ok(LockResult::Acquired)
        };

        // Handle transaction based on result
        match &result {
            Ok(LockResult::KeyReused) | Ok(LockResult::InProgress { .. }) => {
                // These cases don't modify the database, rollback to be safe
                tx.rollback().await.map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
            }
            Ok(LockResult::AlreadyCompleted(_)) | Ok(LockResult::FailedPermanently(_)) => {
                // These cases just read data, rollback to be safe
                tx.rollback().await.map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
            }
            Ok(LockResult::Acquired) => {
                // Successfully acquired lock, commit the changes
                tx.commit().await.map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
            }
            Err(_) => {
                // Error occurred, rollback
                tx.rollback().await.map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
            }
        }

        result
    }

    /// Atomically update record with final result and release lock
    async fn complete_with_response(
        &self,
        key: &str,
        user_id: &str,
        status: IdempotencyStatus,
        response: Option<CachedResponse>,
    ) -> Result<(), IdempotencyError> {
        let _lock = self.transaction_lock.lock().await;
        
        let mut tx = self.pool.begin()
            .await
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        let status_str = Self::serialize_status(&status);
        let headers_json = response.as_ref()
            .map(|r| serde_json::to_string(&r.headers).unwrap_or_default());

        sqlx::query(
            r#"
            UPDATE idempotency_records
            SET status = ?,
                response_status_code = ?,
                response_headers = ?,
                response_body = ?,
                locked_until = NULL
            WHERE key = ? AND user_id = ?
            "#
        )
        .bind(status_str)
        .bind(response.as_ref().map(|r| r.status_code as i32))
        .bind(headers_json)
        .bind(response.as_ref().map(|r| r.body.clone()))
        .bind(key)
        .bind(user_id)
        .execute(&mut *tx)
        .await
        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        tx.commit().await.map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
        Ok(())
    }

    /// Atomically release lock on failure
    async fn release_lock_on_failure(
        &self,
        key: &str,
        user_id: &str,
        is_retryable: bool,
        response: Option<CachedResponse>,
    ) -> Result<(), IdempotencyError> {
        let _lock = self.transaction_lock.lock().await;
        
        let mut tx = self.pool.begin()
            .await
            .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        let status = IdempotencyStatus::Failed { is_retryable };
        let status_str = Self::serialize_status(&status);
        let headers_json = response.as_ref()
            .map(|r| serde_json::to_string(&r.headers).unwrap_or_default());

        sqlx::query(
            r#"
            UPDATE idempotency_records
            SET status = ?,
                response_status_code = ?,
                response_headers = ?,
                response_body = ?,
                locked_until = NULL
            WHERE key = ? AND user_id = ?
            "#
        )
        .bind(status_str)
        .bind(response.as_ref().map(|r| r.status_code as i32))
        .bind(headers_json)
        .bind(response.as_ref().map(|r| r.body.clone()))
        .bind(key)
        .bind(user_id)
        .execute(&mut *tx)
        .await
        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        tx.commit().await.map_err(|e| IdempotencyError::StorageError(e.to_string()))?;
        Ok(())
    }

    async fn get(
        &self,
        key: &str,
        user_id: &str,
    ) -> Result<Option<IdempotencyRecord>, IdempotencyError> {
        let row = sqlx::query(
            r#"
            SELECT key, user_id, request_path, request_method,
                   request_fingerprint, status, response_status_code,
                   response_headers, response_body, created_at,
                   expires_at, locked_until
            FROM idempotency_records
            WHERE key = ? AND user_id = ?
            "#
        )
        .bind(key)
        .bind(user_id)
        .fetch_optional(&self.pool)
        .await
        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        match row {
            Some(row) => Ok(Some(Self::record_from_row(&row).await?)),
            None => Ok(None),
        }
    }

    async fn cleanup_expired(&self) -> Result<usize, IdempotencyError> {
        let now = Utc::now().to_rfc3339();

        let result = sqlx::query(
            "DELETE FROM idempotency_records WHERE expires_at < ?"
        )
        .bind(now)
        .execute(&self.pool)
        .await
        .map_err(|e| IdempotencyError::StorageError(e.to_string()))?;

        Ok(result.rows_affected() as usize)
    }

    async fn execute_in_transaction<F, T>(&self, f: F) -> Result<T, IdempotencyError>
    where
        F: FnOnce() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, IdempotencyError>> + Send>> + Send,
        T: Send,
    {
        let _lock = self.transaction_lock.lock().await;
        
        let tx = self.pool.begin()
            .await
            .map_err(|e| IdempotencyError::TransactionFailed(e.to_string()))?;

        let result = f().await;

        match result {
            Ok(value) => {
                tx.commit().await.map_err(|e| IdempotencyError::TransactionFailed(e.to_string()))?;
                Ok(value)
            }
            Err(e) => {
                tx.rollback().await.map_err(|e| IdempotencyError::TransactionFailed(e.to_string()))?;
                Err(e)
            }
        }
    }
}

The SDK Solution: Making Idempotency Invisible

While requiring clients to generate and manage idempotency keys is correct, it places a burden on developers. Modern cloud providers solve this by embedding idempotency into their SDKs, making it transparent to developers. AWS automatically generates client tokens:

# AWS SDK automatically handles idempotency
ec2_client = boto3.client('ec2')

# The SDK generates a ClientToken internally
response = ec2_client.run_instances(
    ImageId='ami-12345',
    MinCount=1,
    MaxCount=1,
    # No idempotency key needed - SDK handles it
)

# On network failure, the SDK safely retries with the same ClientToken

You can build the same experience. The src/client_sdk.rs file shows how to implement transparent idempotency in a client SDK:

// Users don't need to think about idempotency
let client = IdempotentClient::new("https://api.example.com");

// The SDK handles everything:
// - Generates idempotency key
// - Retries with exponential backoff + jitter  
// - Reuses same key for retries
// - Respects rate limits
let order = client.call(
    "POST",
    "/orders", 
    &CreateOrderRequest {
        symbol: "AAPL",
        quantity: 100,
    }
).await?;

Industry Lessons

The Hidden Cost

Idempotency failures rarely cause outages, which is why they often go unnoticed until they accumulate into bigger problems. But the cost is real:

  • Customer trust erosion: Duplicate charges damage credibility
  • Support overhead: Each duplicate transaction generates support tickets
  • Regulatory risk: Financial duplicate transactions can trigger compliance issues
  • Data corruption: Inconsistent state that’s expensive to clean up

The False Sense of Security

Most teams implement basic duplicate detection and call it “idempotency.” They check a box on their architecture review and move on. Meanwhile, the race conditions and edge cases silently create problems that surface weeks or months later.

The Operational Reality

Even when implemented correctly, idempotency requires operational discipline:

  • Monitoring key collision rates (possible client bugs)
  • Alerting on lock timeout occurrences (performance issues)
  • Tracking retry patterns (client behavior insights)
  • Regular cleanup of expired records (storage management)

The Bottom Line

True idempotency isn’t about preventing duplicates—it’s about providing a consistent, predictable API that clients can safely retry. The difference between “duplicate detection” and real idempotency is the difference between a system that mostly works and one that always works. After seeing too many production incidents caused by misunderstood idempotency, I hope this guide prevents others from making the same expensive mistakes. The patterns I’ve shown here are battle-tested across multiple companies and handle the edge cases that trip up most implementations.


The complete implementation with a storage backend, framework integrations, and deployment examples is available at github.com/bhatti/idempotency-rs.

September 13, 2025

Task Scheduling Algorithms in Distributed Orchestration Systems

Filed under: Computing,Concurrency — admin @ 3:03 pm

Modern distributed systems face a fundamental challenge: how to efficiently schedule and execute thousands of tasks across heterogeneous resources while maximizing throughput, minimizing latency, and ensuring fair resource allocation. This challenge becomes even more complex when dealing with workflows, dependencies, and varying resource requirements. I have written about building Formicary, an open-source distributed orchestration engine before and in this post, I’ll explore task scheduling algorithms for executing background tasks, jobs, and workflows through the lens of Formicary. We’ll examine how theoretical scheduling concepts translate into practical implementations in a production-ready system.

Formicary Architecture Overview

Before diving into scheduling algorithms, let’s understand Formicary’s architecture. The system follows a Leader-Follower pattern with two main components:

The Queen (Leader/Server)

  • API & UI Controllers: RESTful APIs and web dashboard
  • Job Scheduler: Leader-elected service that polls for pending jobs
  • Resource Manager: Tracks available ant workers and their capabilities
  • Job Supervisor: Orchestrates job execution as a DAG
  • Task Supervisor: Manages individual task lifecycle

The Ants (Followers/Workers)

  • Executors: Support for Docker, Kubernetes, Shell, HTTP, WebSocket, and custom protocols
  • Registration System: Workers advertise capabilities via tags and methods
  • Artifact Management: Handle dependencies and outputs

Key Features Supporting Advanced Scheduling

Formicary includes several features that enable sophisticated scheduling strategies:

  • Tags & Labels: Route tasks to specific workers based on capabilities
  • Priority Levels: Jobs can have different priority levels for execution order
  • Resource Constraints: Define CPU, memory, and storage requirements
  • Tenant Isolation: Multi-tenant support with quota management
  • Cron Scheduling: Time-based job scheduling
  • Concurrent Limits: Control maximum parallel job execution
  • Dynamic Scaling: Kubernetes-based auto-scaling support

Scheduling Algorithm Decision Flow

Before diving into specific algorithms, let’s visualize how different scheduling strategies route jobs through the system:

Job Execution Lifecycle

Understanding how jobs flow through the system helps illustrate where different scheduling algorithms take effect:

Wait Time Estimation Algorithm

Formicary implements a sophisticated wait time estimation system that helps users understand queue delays and plan accordingly. The algorithm considers multiple factors:

// Simplified wait time calculation
func CalculateWaitTime(jobRequest, queuePosition, historicalStats, availableWorkers) time.Duration {
    // 1. Find position in priority-ordered queue
    queuePosition := findQueuePosition(jobRequest.Priority, jobRequest.SubmissionTime)
    
    // 2. Calculate jobs ahead in queue (70% of executing jobs assumed near completion)
    jobsAhead := queuePosition + int(float64(executingJobs) * 0.7)
    
    // 3. Estimate based on historical average execution time
    if historicalAverage > 0 && availableWorkers > 0 {
        estimatedWait := time.Duration(float64(jobsAhead) / float64(availableWorkers)) * 
                        time.Duration(historicalAverage) * time.Millisecond
    }
    
    // 4. Account for scheduled future execution
    if jobRequest.ScheduledAt.After(time.Now()) {
        scheduleDiff := time.Until(jobRequest.ScheduledAt)
        if scheduleDiff > estimatedWait {
            estimatedWait = scheduleDiff
        }
    }
    
    return estimatedWait
}

Formicary uses JobStatsRegistry to track execution patterns:

type JobStats struct {
    SucceededJobsAverage int64  // Average execution time
    ExecutingJobs       int32   // Currently running
    AntsCapacity        int     // Available worker capacity
    AntsAvailable       bool    // Worker availability status
}

It considers worker availability and capacity constraints:

  • Calculates minimum available workers across all required task types
  • Accounts for tag-based routing restrictions
  • Factors in Kubernetes resource quotas

Formicary orders pending jobs by priority and submission time:

sort.Slice(pendingJobs, func(i, j int) bool {
    if job1.Priority == job2.Priority {
        return job1.CreatedAt.Before(job2.CreatedAt)  // FCFS within priority
    }
    return job1.Priority > job2.Priority  // Higher priority first
})

This estimation helps with:

  • SLA Management: Predict if jobs will meet deadlines
  • Capacity Planning: Identify when to scale worker pools
  • User Experience: Provide realistic wait time expectations
  • Load Balancing: Route jobs to less congested worker pools

Task Scheduling Algorithms in Practice

Now let’s examine how various scheduling algorithms are implemented or can be achieved in Formicary:

1. First-Come First-Serve (FCFS)

FCFS processes tasks in arrival order using a simple FIFO queue. The algorithm maintains fairness by ensuring no task is starved, but suffers from the “convoy effect” where short jobs wait behind long-running ones. Its characteristics include:

  • Average waiting time: (sum of waiting times) / number of jobs
  • Turnaround time: completion_time - arrival_time
  • No preemption – jobs run to completion

Formicary Implementation: This is Formicary’s default behavior. When jobs are submitted, they’re placed in a PENDING state and processed by the Job Scheduler in submission order.

# Job requests are processed in submission order
job_type: data-processing
description: FCFS example - processed in submission order
tasks:
  - task_type: process-data
    method: DOCKER
    container:
      image: python:3.9
    script:
      - python process_data.py

Pros: Simple, predictable, no starvation Cons: Long-running jobs can block shorter ones, poor average wait times

2. Priority Scheduling

Each job has an assigned priority, with higher priority jobs scheduled first. Priority assignment can be static or dynamic based on various factors. A drawback of of this algorithm is starvation of low-priority jobs but it can be addressed with following techniques:

  • Aging: Gradually increase priority of waiting jobs
  • Priority Inversion Protection: Temporary priority boost for resource holders
  • Fair Share: Ensure each user/tenant gets minimum resource allocation

Formicary Implementation: Jobs support priority levels, and the scheduler considers priority when selecting the next job to execute.

job_type: critical-analysis
priority: 10  # Higher priority job
description: Critical security analysis
tasks:
  - task_type: vulnerability-scan
    method: KUBERNETES
    container:
      image: security-scanner:latest
# Submitting jobs with different priorities
curl -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"job_type": "critical-analysis", "job_priority": 10}' \
  $SERVER/api/jobs/requests

curl -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"job_type": "routine-backup", "job_priority": 1}' \
  $SERVER/api/jobs/requests

Implementation Details: The Job Scheduler queries pending jobs ordered by priority, ensuring high-priority jobs are scheduled first when resources become available.

3. Multilevel Queues – Tag-Based Routing

This algorithm partitions jobs into separate queues based on characteristics (interactive, batch, system). Each queue can use different scheduling algorithms, with inter-queue scheduling typically using fixed priorities or time slicing. Common queue classification strategies include:

  • Job Type: Interactive, batch, system, real-time
  • Resource Requirements: CPU-intensive, I/O-intensive, memory-intensive
  • Duration: Short, medium, long-running jobs
  • User Class: Premium, standard, background users

Formicary Implementation: Using tags and labels, we can effectively create multilevel queues by routing different job types to specialized worker pools.

# Short-running analysis jobs
job_type: quick-scan
tags: ["fast-worker", "analysis"]
tasks:
  - task_type: preflight-check
    method: DOCKER
    tags: ["cpu-optimized"]
    container:
      image: lightweight-scanner:latest

---
# Medium-duration static analysis
job_type: static-analysis
tags: ["medium-worker", "analysis"] 
tasks:
  - task_type: code-analysis
    method: KUBERNETES
    tags: ["memory-optimized"]
    container:
      image: static-analyzer:latest
      memory_limit: "4Gi"

---
# Long-running dynamic analysis
job_type: dynamic-analysis
tags: ["heavy-worker", "analysis"]
tasks:
  - task_type: device-testing
    method: KUBERNETES
    tags: ["gpu-enabled", "device-farm"]
    container:
      image: dynamic-analyzer:latest
      resources:
        cpu_limit: "8"
        memory_limit: "16Gi"

At a mobile security company, I implemented this pattern with three distinct worker pools:

  • Fast Workers: Preflight analysis (seconds to minutes)
  • Medium Workers: Static analysis (a few minutes)
  • Heavy Workers: Dynamic analysis on device farms (multiple minutes to hours)

4. Resource-Aware Scheduling

This algorithm makes scheduling decisions based on current and predicted resource availability (CPU, memory, storage, network). It considers both resource requirements and system capacity to prevent oversubscription:

  • Multi-dimensional: CPU, memory, storage, network, custom resources
  • Quality of Service: Guaranteed, burstable, best-effort resource classes
  • Affinity/Anti-affinity (e.g.,Kubernetes Scheduler): Placement preferences and constraints

Advanced techniques include:

  • Bin Packing: First-fit, best-fit, worst-fit algorithms
  • Resource Fragmentation: Avoid unusable resource leftovers
  • Overcommitment: Schedule based on statistical usage patterns

Formicary Implementation: Integration with Kubernetes resource management and custom resource tracking.

job_type: ml-training
description: Resource-aware ML model training
tasks:
  - task_type: train-model
    method: KUBERNETES
    tags: ["gpu-node", "ml-workload"]
    container:
      image: tensorflow/tensorflow:latest-gpu
      cpu_request: "4"
      cpu_limit: "8"
      memory_request: "8Gi"
      memory_limit: "16Gi"
      ephemeral_storage_request: "10Gi"
    node_selector:
      hardware: "gpu-enabled"
      instance-type: "ml-optimized"
    tolerations:
      - key: "gpu-workload"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

The Resource Manager tracks worker capabilities and current load, ensuring tasks are only scheduled when adequate resources are available.

5. Matchmaking Scheduler – Affinity-Based Routing

It matches jobs to workers based on capabilities, data locality, and preferences. Uses constraint satisfaction to find optimal job-worker pairings. Matching algorithms include:

  • Hungarian Algorithm: Optimal assignment for bipartite matching
  • Market-based: Economic models with bids and auctions
  • Constraint Satisfaction (Kubernetes/Apache Spark): Match job requirements to worker capabilities

Common locality considerations include:

  • Data Locality: Schedule jobs where data resides
  • Network Topology: Minimize network hops and bandwidth usage
  • Hardware Affinity: GPU jobs to GPU nodes, FPGA workloads to FPGA nodes

Formicary Implementation: Using tags, labels, and Kubernetes affinity rules to achieve data locality and capability matching.

job_type: geo-distributed-processing
description: Process data close to its source
tasks:
  - task_type: process-eu-data
    method: KUBERNETES
    tags: ["eu-region", "gdpr-compliant"]
    container:
      image: data-processor:latest
    affinity:
      node_affinity:
        required_during_scheduling_ignored_during_execution:
          node_selector_terms:
            - match_expressions:
                - key: "region"
                  operator: In
                  values: ["eu-west-1", "eu-central-1"]
                - key: "compliance"
                  operator: In
                  values: ["gdpr"]
    variables:
      DATA_REGION: "eu"
      COMPLIANCE_MODE: "strict"

In mobile security analysis company, I used matchmaking scheduling to manage physical device farms where each device has unique characteristics. The system implemented two-phase matchmaking: first reserving devices based on requirements like platform (iOS/Android), OS version ranges, device type (phone/tablet), and capabilities (SMS, camera, GPS), then using affinity rules to route jobs to the specific reserved device.

Pros: Optimal resource matching, data locality, flexibility Cons: Complex matching logic, potential for suboptimal assignments under constraints

6. Delay Scheduler – Temporal Control

Delay scheduling deliberately postpones task execution until optimal conditions are met, such as data locality, resource availability, or specific timing requirements. The algorithm balances waiting for better conditions against potential starvation, often using configurable timeout thresholds.

Optimal_Delay = min(Max_Wait_Time, Expected_Benefit_Time)
Where:
- Max_Wait_Time = configured upper bound to prevent starvation
- Expected_Benefit_Time = estimated time until optimal conditions
- Locality_Benefit = (Remote_Cost - Local_Cost) / Transfer_Rate

Common delay strategies include:

  • Data Locality Delay: Wait for data to become available on local nodes
  • Resource Availability Delay: Wait for preferred resource types to become free
  • Temporal Delay: Execute at specific times (off-peak hours, scheduled windows)
  • Condition-Based Delay: Wait for external system states or events

Formicary’s Delay Implementations:

  1. Time-Based Scheduling:
job_type: nightly-etl
cron_trigger: "0 2 * * *"  # 2 AM daily
scheduled_at: "2024-12-25T02:00:00Z"  # One-time future execution
  1. Condition-Based Polling:
job_type: external-dependency-wait
tasks:
  - task_type: wait-for-api
    method: HTTP_GET
    url: https://api.service.com/status
    delay_between_retries: 30s
    retry: 20  # Maximum 10 minutes of polling
    on_exit_code:
      "200": proceed-with-processing    # Service ready
      "404": EXECUTING                  # Keep polling
      "503": EXECUTING                  # Service temporarily unavailable
      "FAILED": abort-job               # Permanent failure
  1. Resource Availability Delay:
job_type: gpu-intensive-training
tasks:
  - task_type: training
    method: KUBERNETES
    tags: ["gpu-v100", "high-memory"]
    timeout: 6h
    # Will delay until specific GPU resources become available

Pros:

  • Improved data locality and reduced network I/O
  • Better resource utilization through temporal load balancing
  • Flexible execution timing for cost optimization
  • Support for external dependency coordination

Cons:

  • Increased job latency and scheduling complexity
  • Risk of starvation without proper timeout mechanisms
  • Difficulty in predicting optimal delay periods
  • Potential for cascading delays in dependent workflows

7. Capacity Scheduler – Resource Quotas

Capacity scheduling partitions cluster resources into hierarchical queues with guaranteed minimum capacities and configurable maximum limits. Each queue can elastically use unused capacity from other queues while respecting absolute limits and priority policies.

Queue_Capacity = (Allocated_Resources / Total_Cluster_Resources) × 100%
Effective_Capacity = min(Max_Capacity, Guaranteed_Capacity + Available_Borrowed_Capacity)
Resource_Utilization = Used_Resources / Effective_Capacity

Common principles include:

  • Capacity Guarantees: Each queue has minimum guaranteed resources
  • Elastic Sharing: Unused capacity can be borrowed by other queues
  • Preemption: Higher priority queues can reclaim borrowed resources
  • Hierarchical Organization: Nested queues for organizational structure

Common queue management strategies include:

  • FIFO within Queues: Simple first-come-first-served within capacity limits
  • Priority Ordering: High-priority jobs within queues get preference
  • Fair Share: Proportional resource distribution among queue users
  • Preemption Policies: Graceful vs. aggressive resource reclamation

Formicary Implementation Features:

# Organization-level capacity limits
tenant_limits:
  max_concurrent_jobs: 50
  max_cpu_hours_per_day: 200
  max_storage_gb: 500

# Job-level concurrency control  
job_type: batch-processing
max_concurrency: 3  # Limit concurrent instances

Capacity enforcement mechanisms include:

  • Hard Limits: Absolute maximum resource consumption
  • Soft Limits: Warning thresholds with potential throttling
  • Burst Capacity: Temporary exceeding of limits during low contention
  • Quota Reset Periods: Time-based quota renewals (daily, weekly, monthly)

Pros: Predictable resource allocation, multi-tenant isolation, elastic resource sharing, hierarchical management Cons: Complex configuration, potential resource fragmentation, underutilization during low demand, administrative overhead

8. Fair Scheduler – Multi-Tenant Fairness

It ensures proportional resource sharing among users, groups, or tenants over time. Uses techniques like weighted fair queueing and deficit round-robin to achieve long-term fairness while maintaining efficiency. Common metrics include:

  • Proportional Share (Hadoop Fair Scheduler): Resources allocated based on weights/quotas
  • Max-Min Fairness (Kubernetes): Maximize minimum allocation across users
  • Deadline Fairness: Ensure SLA compliance across tenants

Advanced fair sharing includes:

  • Hierarchical Fair Sharing: Nested user groups and organizations
  • Dominant Resource Fairness: Fair allocation across multiple resource types
  • Lottery Scheduling: Probabilistic fairness using tickets

Formicary Implementation: It implements tenant isolation with quota enforcement.

// Formicary's Fair Scheduling based on actual implementation
type FairScheduler struct {
    jobStatsRegistry *JobStatsRegistry
    serverConfig     *ServerConfig
}

func (fs *FairScheduler) CheckFairSchedulingConstraints(request *JobRequest) error {
    // Multi-level concurrency checking: Organization ? User ? Job-level
    
    // Level 1: Organization concurrency limits
    userExecuting, orgExecuting := fs.jobStatsRegistry.UserOrgExecuting(request)
    
    if orgExecuting >= fs.getMaxConcurrentOrgJobs(request.OrganizationID) {
        return fs.delayJobForConcurrencyExceeded(request, "organization", orgExecuting)
    }
    
    // Level 2: User concurrency limits  
    if userExecuting >= fs.getMaxConcurrentUserJobs(request.UserID) {
        return fs.delayJobForConcurrencyExceeded(request, "user", userExecuting)
    }
    
    // Level 3: Job-type concurrency limits
    executionCount := fs.jobStatsRegistry.GetExecutionCount(request.GetUserJobTypeKey())
    if executionCount >= request.GetMaxConcurrency() {
        return fs.delayJobForConcurrencyExceeded(request, "job-type", int(executionCount))
    }
    
    return nil
}

func (fs *FairScheduler) delayJobForConcurrencyExceeded(request *JobRequest, limitType string, currentCount int) error {
    // Intelligent delay calculation based on historical data
    avgCompletionTime := fs.jobStatsRegistry.GetAverageCompletionTime(request.JobType)
    
    // Dynamic wait factor: 25% of average completion time, bounded between 15-60 seconds
    waitFactor := min(max(avgCompletionTime/4, 15*time.Second), 60*time.Second)
    
    // Randomized delay to prevent thundering herd
    randomDelay := time.Duration(rand.Intn(int(waitFactor))) + waitFactor
    
    // Reschedule with delay
    request.ScheduledAt = time.Now().Add(randomDelay)
    
    // Logarithmic priority degradation (inspired by mobile security company approach)
    if request.Priority > 0 {
        // Priority degradation: log_e(original_priority), minimum of 1
        newPriority := max(1, int(math.Log(float64(request.Priority))))
        request.Priority = newPriority
        
        // Allow zero-priority jobs to bypass concurrency limits (emergency valve)
        if newPriority <= 0 {
            return nil // Allow execution despite limits
        }
    }
    
    // Update schedule attempts counter with exponential backoff
    request.ScheduleAttempts++
    
    return fmt.Errorf("%s concurrency limit exceeded: %d jobs running, rescheduling with %v delay", 
                     limitType, currentCount, randomDelay)
}

// Enhanced concurrency tracking from mobile security company experience
func (fs *FairScheduler) trackConcurrencyMetrics(request *JobRequest) {
    // Real-time metrics for monitoring fairness
    fs.metricsRegistry.Gauge("org_concurrent_jobs", map[string]string{
        "org_id": request.OrganizationID,
        "job_type": request.JobType,
    })
    
    fs.metricsRegistry.Gauge("user_concurrent_jobs", map[string]string{
        "user_id": request.UserID, 
        "job_type": request.JobType,
    })
}

Pros: Prevents monopolization, guarantees minimum service levels Cons: May sacrifice efficiency for fairness, complex weight management

9. Earliest Deadline First (EDF)

Dynamic priority algorithm that assigns highest priority to tasks with earliest deadlines. Optimal for single-processor real-time scheduling if total utilization ? 100%. It uses deadline as the primary scheduling criterion. SJF selects the job with the smallest estimated execution time, minimizing average waiting time.

Schedulability Test: ?(Ci/Ti) ? 1
Where Ci = execution time, Ti = period for periodic tasks

EDF is a dynamic priority scheduling algorithm that assigns highest priority to tasks with the earliest absolute deadlines. It’s provably optimal for single-processor real-time scheduling when total CPU utilization ? 100%, providing maximum schedulability under deadline constraints.

Priority(task) = 1 / (Deadline - Current_Time)
Schedulability_Test: ?(Execution_Time_i / Period_i) ? 1
Laxity = Deadline - Current_Time - Remaining_Execution_Time

Core characteristics include:

  • Dynamic Priority: Priorities change as deadlines approach
  • Work-Conserving: Never idles processor when tasks are ready
  • Deadline-Driven: Scheduling decisions based purely on temporal constraints
  • Optimal Utilization: Achieves 100% processor utilization when schedulable

Failure modes include:

  • Domino Effect: Single deadline miss can cascade to subsequent tasks
  • Thrashing: Excessive context switching under overload conditions
  • Unpredictable Overload: Graceful degradation requires additional mechanisms

While Formicary does not natively supported, SJF can be approximated using separate queues for different job duration categories:

# Short jobs queue
job_type: quick-validation
tags: ["short-queue"]
estimated_runtime: "5m"

# Long jobs queue  
job_type: full-analysis
tags: ["long-queue"] 
estimated_runtime: "2h"

Deadline assignment strategies include:

Relative_Deadline = Period (for periodic tasks)
Absolute_Deadline = Arrival_Time + Relative_Deadline
Critical_Instant = Simultaneous release of all tasks

It can also be simulated using priority scheduling combined with deadline-aware job submission:

# Simulated EDF using priority and scheduled execution
job_type: time-critical-analysis
priority: {{.UrgencyScore}}  # Calculated based on deadline proximity
scheduled_at: "2024-12-31T23:59:59Z"
timeout: 2h
tasks:
  - task_type: urgent-processing
    method: KUBERNETES
    tags: ["priority-worker"]

Implementation Approach: Calculate priority dynamically based on (current_time - deadline) / estimated_runtime to ensure jobs closer to their deadlines receive higher priority.

Pros: Optimal schedulability, maximum resource utilization, simple algorithm, responsive to urgent tasks Cons: Domino effect failures, requires accurate execution time estimates, poor overload behavior, high context switching overhead

10. Speculative Scheduler

It launches multiple instances of slow-running tasks to reduce tail latency. Uses statistical analysis of execution times to detect stragglers and make speculative execution decisions. Balances resource cost against latency improvement. Algorithms include:

  • Progress-based: Monitor task completion percentage
  • Time-based: Tasks running longer than percentile threshold
  • Resource-based: Launch backup only if resources available

Pros: Reduces tail latency, improves user experience, fault tolerance Cons: Resource waste, coordination overhead, may not help heterogeneous workloads

Formicary Status: Not implemented, but the system provides foundation through:

  • Task execution monitoring
  • Historical performance data collection
  • Resource availability tracking

Conceptual Implementation:

job_type: speculative-execution
tasks:
  - task_type: main-task
    method: KUBERNETES
    timeout: 30m
    speculative_backup:
      enabled: true
      delay_threshold: "150%"  # Start backup if 50% slower than average
      resource_threshold: 0.3  # Only if 30%+ resources available

11: Polling and Sensors

Beyond time-based delays, Formicary supports condition-based scheduling through polling sensors that wait for external conditions to be met:

job_type: sensor-job
description: Wait for external conditions before proceeding
tasks:
  - task_type: wait-for-resource
    method: HTTP_GET
    url: https://api.example.com/resource/123
    delay_between_retries: 15s
    retry: 20  # Poll up to 20 times (5 minutes total)
    timeout: 15s
    on_exit_code:
      "200": process-resource    # Success - proceed
      "404": EXECUTING          # Resource not ready - poll again  
      "FAILED": FATAL           # Server error - abort job

  - task_type: process-resource
    method: DOCKER
    container:
      image: data-processor:latest
    script:
      - echo "Resource is now available, processing..."

The key insight is using EXECUTING as an exit code value, which keeps the task in a polling loop rather than completing or failing.

12. Gang Scheduling

Gang scheduling coordinates simultaneous execution of related tasks that need to communicate or synchronize. Instead of scheduling tasks independently, the system reserves resources for all tasks in a “gang” and schedules them together to avoid partial execution and resource deadlocks. Key principles include:

  • All-or-Nothing (MPI Applications): Either all tasks in the gang get scheduled or none do
  • Synchronized Start: Tasks begin execution simultaneously
  • Resource Reservation (Kubernetes Jobs): Pre-allocate resources for the entire task group
  • Communication Optimization: Minimize synchronization delays between related tasks
Gang_Size = max(task_count, critical_path_parallelism)
Resource_Requirement = ?(individual_task_resources) for all gang members
Schedulability = available_resources >= Resource_Requirement

Formicary’s Gang Scheduling Implementation: Formicary implements gang scheduling at the job level through its Resource Manager. When a job is scheduled, resources are pre-allocated for ALL tasks before execution begins:

// Core gang scheduling logic from ResourceManager
func (rm *ManagerImpl) doReserveJobResources(
	requestID string,
	def *types.JobDefinition,
	dryRun bool) (reservations map[string]*common.AntReservation, err error) {
	
	reservations = make(map[string]*common.AntReservation)
	var alloc *common.AntReservation
	
	// Try to reserve resources for each task
	for _, task := range def.Tasks {
		alloc, err = rm.doReserve(requestID, task.TaskType, task.Method, task.Tags, dryRun)
		if err == nil {
			reservations[task.TaskType] = alloc
		} else {
			if !dryRun {
				// ALL-OR-NOTHING: Release all allocations and fail entire job
				_ = rm.ReleaseJobResources(requestID)
			}
			return nil, err
		}
	}
	return reservations, nil
}

Two-Phase Gang Scheduling Process:

  1. Resource Check Phase (Dry Run):
// Check if all job resources are available without allocating
func (rm *ManagerImpl) CheckJobResources(job *types.JobDefinition) ([]*common.AntReservation, error) {
	if reservationsByTask, err := rm.doReserveJobResources("", job, true); err != nil {
		return nil, err // Gang scheduling not possible
	}
	// All tasks can be scheduled - proceed to actual reservation
}
  1. Resource Reservation Phase (Actual Allocation):
// Atomically reserve resources for all tasks
func (rm *ManagerImpl) ReserveJobResources(requestID string, def *types.JobDefinition) (map[string]*common.AntReservation, error) {
	return rm.doReserveJobResources(requestID, def, false)
}

Scheduler Integration with Gang Scheduling:

The Job Scheduler uses gang scheduling through a two-step verification process:

// Step 1: Check if gang scheduling is possible
if err = jobStateMachine.CheckAntResourcesAndConcurrencyForJob(); err != nil {
	// Gang scheduling failed - implement backoff strategy
	scheduleAttempts := request.ScheduleAttempts + 1
	scheduleSecs := math.Min(int(maxWait.Seconds()), scheduleAttempts*5)
	
	// Exponential backoff with priority degradation
	if scheduleAttempts >= 5 && scheduleAttempts%5 == 0 && request.JobPriority > 5 {
		request.JobPriority-- // Degrade priority every 5th attempt
	}
	
	request.ScheduledAt = request.ScheduledAt.Add(time.Duration(scheduleSecs) * time.Second)
	return fmt.Errorf("gang scheduling failed - will retry")
}

// Step 2: Perform actual gang scheduling
if err = jobStateMachine.ReserveJobResources(); err != nil {
	// Even after check, allocation failed - very rare race condition
	return fmt.Errorf("gang allocation failed after successful check")
}

Example Job with Gang Scheduling:

job_type: distributed-ml-training
description: Gang scheduled ML training requiring coordinated execution
tasks:
  - task_type: parameter-server
    method: KUBERNETES
    tags: ["ml-cluster", "coordinator"]
    container:
      image: tensorflow:latest
      cpu_request: "4"
      memory_request: "8Gi"
    
  - task_type: worker-node-1
    method: KUBERNETES  
    tags: ["ml-cluster", "gpu-enabled"]
    container:
      image: tensorflow:gpu
      cpu_request: "8"
      memory_request: "16Gi"
      
  - task_type: worker-node-2
    method: KUBERNETES
    tags: ["ml-cluster", "gpu-enabled"]
    container:
      image: tensorflow:gpu
      cpu_request: "8" 
      memory_request: "16Gi"

  - task_type: aggregator
    method: KUBERNETES
    tags: ["ml-cluster"]
    dependencies: ["parameter-server", "worker-node-1", "worker-node-2"]

Formicary’s Gang Scheduling Features:

  1. Atomic Resource Allocation: All task resources are reserved simultaneously
  2. Automatic Rollback: Failed gang allocation releases all previously reserved resources
  3. Backoff Strategy: Jobs that can’t be gang-scheduled use exponential backoff
  4. Priority Degradation: Long-waiting jobs have priority reduced to prevent resource hogging
  5. Resource Fragmentation Prevention: Avoids partial allocations that waste resources

Formicary tracks gang scheduling effectiveness through metrics:

  • scheduler_failed_total: Jobs that couldn’t be gang-scheduled
  • scheduler_no_more_jobs_total: Scheduler iterations with no schedulable jobs
  • Schedule attempts per job to identify resource contention patterns

Pros:

  • Eliminates partial execution and resource deadlocks
  • Optimal for tightly-coupled distributed workloads
  • Automatic retry with intelligent backoff
  • Priority-based fairness with degradation

Cons:

  • Can lead to resource underutilization
  • Higher latency for large job gangs
  • Complex resource accounting and cleanup
  • May cause convoy effects for large jobs

Integration with Other Algorithms: Gang scheduling in Formicary works alongside:

  • Priority Scheduling: Higher priority gangs get resources first
  • Resource-Aware Scheduling: Considers total gang resource requirements
  • Fair Scheduling: Gang resource consumption counted toward tenant quotas
  • Capacity Scheduling: Gangs compete for available cluster capacity

Advanced Scheduling Patterns

Hybrid Scheduling Strategy

In practice, Formicary often combines multiple scheduling algorithms:

job_type: hybrid-ml-pipeline
priority: 8                    # Priority scheduling
max_concurrency: 3             # Capacity scheduling  
tags: ["gpu-cluster"]          # Matchmaking
cron_trigger: "0 */6 * * *"    # Delay scheduling
tasks:
  - task_type: data-preprocessing
    tags: ["cpu-optimized"]     # Multilevel queues
    method: KUBERNETES
    container:
      cpu_request: "2"          # Resource-aware
      memory_request: "4Gi"
  
  - task_type: model-training
    tags: ["gpu-optimized"]
    method: KUBERNETES
    container:
      image: tensorflow:gpu
    dependencies: ["data-preprocessing"]

Fork-Join Pattern for Parallel Processing

Formicary supports sophisticated parallel execution patterns:

job_type: parallel-video-encoding
description: Parallel processing with fork-join
tasks:
  - task_type: split-video
    method: DOCKER
    container:
      image: ffmpeg:latest
    script:
      - ffmpeg -i input.mp4 -f segment segment_%03d.mp4
    on_completed: fork-encode

  - task_type: fork-encode
    method: FORK_JOB
    fork_job_type: encode-segment
    variables:
      segment_count: 8
    on_completed: await-encoding

  - task_type: await-encoding  
    method: AWAIT_FORKED_JOB
    await_forked_tasks: ["fork-encode"]
    on_completed: merge-video

  - task_type: merge-video
    method: DOCKER
    container:
      image: ffmpeg:latest
    script:
      - ffmpeg -f concat -i segments.txt -c copy output.mp4

Performance Optimizations

Cache-Aware Scheduling

Formicary supports dependency caching to improve scheduling efficiency:

job_type: node-build
tasks:
  - task_type: install-deps
    method: DOCKER
    container:
      image: node:16
    cache:
      key: "node-deps-{{checksum 'package-lock.json'}}"
      paths:
        - node_modules
    script:
      - npm ci

Artifact-Based Dependencies

Smart scheduling based on artifact availability:

job_type: deployment-pipeline
tasks:
  - task_type: build
    artifacts:
      paths: ["dist/"]
    
  - task_type: test
    dependencies: ["build"]  # Waits for build artifacts
    
  - task_type: deploy
    dependencies: ["test"]
    method: KUBERNETES

Monitoring and Observability

Effective scheduling requires comprehensive monitoring:

# Built-in metrics and alerts
metrics:
  - queue_depth_by_priority
  - average_wait_time_by_tag
  - resource_utilization_by_worker
  - job_completion_rate_by_tenant

alerts:
  - name: high_queue_depth
    condition: queue_depth > 1000
    action: scale_workers
    
  - name: poor_resource_utilization  
    condition: cpu_utilization < 30%
    action: consolidate_workers

Real-World Case Study: Mobile Security Analysis Platform

At a mobile security company, I implemented a similar system with three-tier scheduling:

Tier 1: Preflight Analysis (Fast Queue)

  • Duration: 30 seconds – 2 minutes
  • Workers: CPU-optimized containers
  • Algorithm: Priority + FCFS
  • Use Case: Basic file validation, metadata extraction

Tier 2: Static Analysis (Medium Queue)

  • Duration: 5 minutes – 1 hour
  • Workers: Memory-optimized containers
  • Algorithm: Resource-aware + Fair scheduling
  • Use Case: Code analysis, vulnerability scanning

Tier 3: Dynamic Analysis (Heavy Queue)

  • Duration: 1 – 8 hours
  • Workers: GPU-enabled device farm
  • Algorithm: Matchmaking + Capacity scheduling
  • Use Case: Runtime behavior analysis, ML inference

This architecture processed over 100,000 mobile apps daily with 99.9% availability and optimal resource utilization.

Best Practices and Lessons Learned

1. Start Simple, Scale Gradually

Begin with FCFS and basic priority scheduling. Add complexity as your workload characteristics become clear.

2. Observability

Track queue depths, wait times, resource utilization, and job completion rates by different dimensions (tenant, job type, worker pool).

3. Design for Elasticity

Use Kubernetes HPA and custom metrics to automatically scale worker pools based on queue depth and resource utilization.

4. Implement Circuit Breakers

Prevent cascading failures when downstream services are unavailable.

5. Use Dead Letter Queues

Handle persistently failing tasks gracefully:

error_handling:
  max_retries: 3
  dead_letter_queue: "failed-jobs"
  alert_on_dlq: true
  • Earliest Deadline First: For time-sensitive workflows
  • Speculative Execution: For fault tolerance and performance
  • Gang Scheduling: For tightly-coupled parallel jobs

Conclusion

Formicary demonstrates how theoretical scheduling algorithms translate into practical distributed systems. It combines multiple strategies—priority scheduling, resource awareness, fair sharing, and intelligent routing for handling diverse workloads while maintaining predictable performance. The key insight is that real-world schedulers rarely use a single algorithm. Instead, they combine multiple approaches, leveraging the strengths of each for different aspects of the scheduling problem. Tags and labels provide the flexibility to implement sophisticated routing logic, while Kubernetes integration enables resource-aware scheduling at scale.

Whether you’re building CI/CD pipelines, data processing workflows, or ML training systems, understanding these scheduling patterns and their trade-offs is crucial for designing systems that scale efficiently and reliably.


Formicary is open source and available at github.com/bhatti/formicary. Try it out for your next workflow automation project!

September 12, 2025

The Byzantine Generals Problem: A Modern Performance Analysis in Elixir, Erlang, and Rust

Filed under: Computing,Concurrency — admin @ 2:35 pm

Introduction

In 2007, I wrote about implementing Leslie Lamport’s Byzantine Generals Problem algorithm across several programming languages. At the time, this seemed like an interesting theoretical exercise in distributed computing. I didn’t realize that a year later, Satoshi Nakamoto would publish the Bitcoin whitepaper, introducing a decentralized, Sybil resistant digital currency that solved Byzantine fault tolerance at unprecedented scale.

Nearly two decades later, I’m returning to the Byzantine Generals Problem with the perspective that only hindsight provides. This updated post implements the algorithm in modern languages—Rust, Elixir, and contemporary Erlang.

The Byzantine Generals Problem: A Refresher

The Byzantine Generals Problem, first formalized by Leslie Lamport, addresses a fundamental challenge in distributed computing: how can distributed parties reach consensus when some parties may be unreliable or malicious? For example, imagine several divisions of the Byzantine army camped outside an enemy city, each commanded by a general. The generals must coordinate to either attack or retreat, but they can only communicate by messenger. The challenge: some generals might be traitors who will try to confuse the others by sending conflicting messages. For a solution to work, two conditions must be met:

  1. IC1: All loyal lieutenants obey the same order
  2. IC2: If the commanding general is loyal, then every loyal lieutenant obeys the order he sends

One of the most striking results is that no solution exists with fewer than 3m + 1 generals to handle m traitors. With only three generals, no algorithm can handle even a single traitor.

Why This Matters

When I originally wrote about this problem in 2007, Bitcoin didn’t exist. Satoshi Nakamoto’s whitepaper was published in 2008, and the first Bitcoin block wasn’t mined in 2009. Bitcoin’s proof-of-work consensus mechanism essentially solves the Byzantine Generals Problem in a novel way:

  • Generals = Miners: Each miner is like a general trying to reach consensus
  • Orders = Transactions: The “order” is which transactions to include in the next block
  • Traitors = Malicious Miners: Some miners might try to double-spend or create invalid blocks
  • Solution = Longest Chain: The network accepts the longest valid chain as truth

Bitcoin’s brilliant insight was using computational work (proof-of-work) as a way to make it economically expensive to be a “traitor.” As long as honest miners control more than 50% of the computing power, the system remains secure.

Modern Applications Beyond Blockchain

The Byzantine Generals Problem isn’t just about cryptocurrency. It’s fundamental to many critical systems:

  • Aircraft Control Systems: Multiple redundant computers must agree on flight controls
  • Satellite Networks: Space-based systems need fault tolerance against radiation-induced failures
  • Missile Defense: Critical decisions must be made reliably even with component failures
  • Distributed Databases: Systems like Apache Cassandra and MongoDB use Byzantine fault-tolerant algorithms
  • Container Orchestration: Kubernetes uses etcd, which implements Byzantine fault-tolerant consensus
  • Central Bank Digital Currencies (CBDCs): Many countries are exploring blockchain-based national currencies
  • Cross-Border Payments: Systems like Ripple use Byzantine fault-tolerant consensus

Implementation: Modern Languages for a Classic Problem

Let’s implement the Byzantine Generals Problem in three modern languages: Rust, Elixir, and updated Erlang. Each brings different strengths to distributed computing.

Why These Languages?

  • Rust: Memory safety without garbage collection, excellent for systems programming
  • Elixir: Built on the Actor model, designed for fault-tolerant distributed systems
  • Erlang: The original Actor model language, battle-tested in telecom systems

Core Algorithm

We’ll implement the OM(m) algorithm (Oral Messages with m traitors) that works for 3m + 1 or more generals.

Rust Implementation

use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::{Duration, Instant};
use tracing::{debug, info};

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub enum Value {
    Zero,
    One,
    Retreat, // Default value
}

impl std::fmt::Display for Value {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Value::Zero => write!(f, "ZERO"),
            Value::One => write!(f, "ONE"),
            Value::Retreat => write!(f, "RETREAT"),
        }
    }
}

#[derive(Clone, Debug)]
pub struct Configuration {
    pub source: usize,
    pub num_rounds: usize,
    pub num_processes: usize,
}

pub struct ByzantineEngine {
    config: Configuration,
    processes: Vec<Arc<Mutex<Process>>>,
    message_count: Arc<Mutex<usize>>,
}

pub struct Process {
    id: usize,
    config: Configuration,
    values: HashMap<String, Value>,
    is_faulty: bool,
}

impl Process {
    pub fn new(id: usize, config: Configuration) -> Self {
        let is_faulty = id == config.source || id == 2; // Configure faulty processes
        Process {
            id,
            config,
            values: HashMap::new(),
            is_faulty,
        }
    }

    pub fn receive_message(&mut self, path: String, value: Value) {
        debug!("Process {} received message: path={}, value={:?}", self.id, path, value);
        self.values.insert(path, value);
    }

    pub fn send_messages(&self, round: usize, processes: &[Arc<Mutex<Process>>], 
                        message_count: Arc<Mutex<usize>>) {
        if round == 0 && self.id == self.config.source {
            self.send_initial_messages(processes, message_count);
        } else if round > 0 {
            self.relay_messages(round, processes, message_count);
        }
    }

    fn send_initial_messages(&self, processes: &[Arc<Mutex<Process>>], 
                           message_count: Arc<Mutex<usize>>) {
        let base_value = Value::Zero;
        
        for (i, process) in processes.iter().enumerate() {
            if i != self.id {
                let value = if self.is_faulty {
                    // Faulty commander sends different values to different processes
                    if i % 2 == 0 { Value::Zero } else { Value::One }
                } else {
                    base_value.clone()
                };
                
                let value_for_log = value.clone(); // Clone for logging
                let mut proc = process.lock().unwrap();
                proc.receive_message(self.id.to_string(), value);
                *message_count.lock().unwrap() += 1;
                
                debug!("Commander {} sent {:?} to process {}", self.id, value_for_log, i);
            }
        }
    }

    fn relay_messages(&self, round: usize, processes: &[Arc<Mutex<Process>>], 
                     message_count: Arc<Mutex<usize>>) {
        let paths = self.get_paths_for_round(round - 1);
        
        for path in paths {
            if let Some(value) = self.values.get(&path) {
                let new_value = self.transform_value(value.clone());
                let new_path = format!("{}{}", path, self.id);
                
                for (i, process) in processes.iter().enumerate() {
                    if i != self.id && !self.path_contains_process(&new_path, i) {
                        let mut proc = process.lock().unwrap();
                        proc.receive_message(new_path.clone(), new_value.clone());
                        *message_count.lock().unwrap() += 1;
                        
                        debug!("Process {} relayed {:?} to process {} with path {}", 
                               self.id, new_value, i, new_path);
                    }
                }
            }
        }
    }

    fn transform_value(&self, value: Value) -> Value {
        if self.is_faulty && self.id == 2 {
            Value::One // Process 2 always sends One when faulty
        } else {
            value
        }
    }

    fn get_paths_for_round(&self, round: usize) -> Vec<String> {
        if round == 0 {
            vec![self.config.source.to_string()]
        } else {
            self.values.keys()
                .filter(|path| path.len() == round + 1)
                .cloned()
                .collect()
        }
    }

    fn path_contains_process(&self, path: &str, process_id: usize) -> bool {
        path.contains(&process_id.to_string())
    }

    pub fn decide(&self) -> Value {
        if self.id == self.config.source {
            // Source process uses its own value
            return if self.is_faulty { Value::One } else { Value::Zero };
        }

        self.majority_vote()
    }

    fn majority_vote(&self) -> Value {
        let mut counts = HashMap::new();
        counts.insert(Value::Zero, 0);
        counts.insert(Value::One, 0);
        counts.insert(Value::Retreat, 0);

        // Count values from the final round paths
        let final_paths: Vec<_> = self.values.keys()
            .filter(|path| path.len() == self.config.num_rounds + 1)
            .collect();

        if final_paths.is_empty() {
            // Count all available values if no final round paths
            for value in self.values.values() {
                *counts.entry(value.clone()).or_insert(0) += 1;
            }
        } else {
            for path in final_paths {
                if let Some(value) = self.values.get(path) {
                    *counts.entry(value.clone()).or_insert(0) += 1;
                }
            }
        }

        debug!("Process {} vote counts: {:?}", self.id, counts);

        // Find majority
        let total_votes: usize = counts.values().sum();
        if total_votes == 0 {
            return Value::Retreat;
        }

        let majority_threshold = total_votes / 2;
        
        for (value, count) in counts {
            if count > majority_threshold {
                return value;
            }
        }

        Value::Retreat // Default if no majority
    }

    pub fn is_faulty(&self) -> bool {
        self.is_faulty
    }

    pub fn is_source(&self) -> bool {
        self.id == self.config.source
    }
}

impl ByzantineEngine {
    pub fn new(source: usize, num_rounds: usize, num_processes: usize) -> Self {
        let config = Configuration { source, num_rounds, num_processes };
        let processes: Vec<Arc<Mutex<Process>>> = (0..num_processes)
            .map(|id| Arc::new(Mutex::new(Process::new(id, config.clone()))))
            .collect();

        ByzantineEngine {
            config,
            processes,
            message_count: Arc::new(Mutex::new(0)),
        }
    }

    pub fn run(&self) -> (Duration, usize) {
        info!("Starting Byzantine Generals algorithm with {} processes, {} rounds", 
              self.config.num_processes, self.config.num_rounds);
        
        let start = Instant::now();
        
        for round in 0..self.config.num_rounds {
            debug!("Starting round {}", round);
            
            let handles: Vec<_> = self.processes.iter().enumerate().map(|(_id, process)| {
                let process = Arc::clone(process);
                let processes = self.processes.clone();
                let message_count = Arc::clone(&self.message_count);
                
                thread::spawn(move || {
                    let proc = process.lock().unwrap();
                    proc.send_messages(round, &processes, message_count);
                })
            }).collect();

            for handle in handles {
                // Add timeout to prevent hanging
                if handle.join().is_err() {
                    eprintln!("Warning: Thread failed in round {}", round);
                }
            }
            
            debug!("Completed round {}", round);
            // Small delay to ensure message ordering
            thread::sleep(Duration::from_millis(10));
        }

        let duration = start.elapsed();
        let messages = *self.message_count.lock().unwrap();
        
        info!("Algorithm completed in {:.2}ms with {} messages", 
              duration.as_millis(), messages);
        
        self.print_results();
        
        (duration, messages)
    }

    fn print_results(&self) {
        println!("\nByzantine Generals Results:");
        println!("===========================");
        
        for (id, process) in self.processes.iter().enumerate() {
            let proc = process.lock().unwrap();
            if proc.is_source() {
                print!("Source ");
            }
            print!("Process {}", id);
            if proc.is_faulty() {
                println!(" is faulty");
            } else {
                println!(" decides on value {}", proc.decide());
            }
        }
        println!();
    }
}

pub fn benchmark_comprehensive(max_processes: usize) {
    let test_cases = generate_test_cases(max_processes);
    
    for (processes, rounds) in test_cases {
        if processes < 4 {
            continue; // Skip invalid cases
        }
        
        let source = processes / 3;
        let engine = ByzantineEngine::new(source, rounds, processes);
        
        let start_memory = get_memory_usage();
        let start = Instant::now();
        let (duration, messages) = engine.run();
        let _total_duration = start.elapsed();
        let end_memory = get_memory_usage();
        
        let memory_used = end_memory.saturating_sub(start_memory);
        
        println!("Rust,{},{},{},{:.2},{:.2}", 
                processes, rounds, messages, 
                duration.as_millis(), memory_used as f64 / 1024.0 / 1024.0);
    }
}

fn generate_test_cases(max_processes: usize) -> Vec<(usize, usize)> {
    let mut cases = Vec::new();
    
    for n in (4..=max_processes).step_by(3) {
        for m in 1..=3 {
            if 3 * m + 1 <= n {
                cases.push((n, m));
            }
        }
    }
    
    cases
}

fn get_memory_usage() -> usize {
    // Simplified memory usage - would need platform-specific code for accurate measurement
    std::process::id() as usize * 1024 // Placeholder
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_value_display() {
        assert_eq!(format!("{}", Value::Zero), "ZERO");
        assert_eq!(format!("{}", Value::One), "ONE");
        assert_eq!(format!("{}", Value::Retreat), "RETREAT");
    }

    #[test]
    fn test_process_creation() {
        let config = Configuration {
            source: 0,
            num_rounds: 2,
            num_processes: 4,
        };
        
        let process = Process::new(0, config.clone());
        assert!(process.is_source());
        assert!(process.is_faulty()); // Source is faulty in our test setup
        
        let process2 = Process::new(1, config);
        assert!(!process2.is_source());
        assert!(!process2.is_faulty());
    }

    #[test]
    fn test_engine_creation() {
        let engine = ByzantineEngine::new(0, 2, 4);
        assert_eq!(engine.config.source, 0);
        assert_eq!(engine.config.num_rounds, 2);
        assert_eq!(engine.config.num_processes, 4);
        assert_eq!(engine.processes.len(), 4);
    }

    #[test]
    fn test_minimum_byzantine_case() {
        let engine = ByzantineEngine::new(0, 1, 4);
        let (duration, messages) = engine.run();
        
        assert!(duration.as_nanos() > 0);
        assert!(messages > 0);
    }
}

Elixir Implementation

defmodule ByzantineGenerals do
  @moduledoc """
  Byzantine Generals Problem implementation in Elixir
  Leverages the Actor model for natural distributed computing
  """

  require Logger

  defmodule Configuration do
    @moduledoc "Configuration for Byzantine Generals algorithm"
    defstruct [:source, :num_rounds, :num_processes]

    @type t :: %__MODULE__{
      source: non_neg_integer(),
      num_rounds: non_neg_integer(),
      num_processes: non_neg_integer()
    }
  end

  defmodule Process do
    @moduledoc "Individual process (general) in the Byzantine Generals algorithm"
    use GenServer
    require Logger

    defstruct [:id, :config, :values, :is_faulty, :message_count, :processes]

    @type value :: :zero | :one | :retreat
    @type path :: String.t()

    # Client API

    def start_link(%{id: id, config: config}) do
      GenServer.start_link(__MODULE__, %{id: id, config: config}, name: :"process_#{id}")
    end

    def receive_message(pid, path, value) do
      GenServer.call(pid, {:receive_message, path, value}, 10_000)
    end

    def send_messages(pid, round, processes) do
      GenServer.call(pid, {:send_messages, round, processes}, 10_000)
    end

    def decide(pid) do
      GenServer.call(pid, :decide, 5_000)
    end

    def is_faulty?(pid) do
      GenServer.call(pid, :is_faulty, 1_000)
    end

    def is_source?(pid) do
      GenServer.call(pid, :is_source, 1_000)
    end

    def get_message_count(pid) do
      GenServer.call(pid, :get_message_count, 1_000)
    end

    def get_values(pid) do
      GenServer.call(pid, :get_values, 1_000)
    end

    # Server callbacks

    @impl true
    def init(%{id: id, config: config}) do
      is_faulty = id == config.source || id == 2
      
      state = %__MODULE__{
        id: id,
        config: config,
        values: %{},
        is_faulty: is_faulty,
        message_count: 0,
        processes: []
      }
      
      Logger.debug("Process #{id} initialized, faulty: #{is_faulty}")
      {:ok, state}
    end

    @impl true
    def handle_call({:receive_message, path, value}, _from, state) do
      Logger.debug("Process #{state.id} received message: path=#{path}, value=#{value}")
      
      new_values = Map.put(state.values, path, value)
      new_count = state.message_count + 1
      
      {:reply, :ok, %{state | values: new_values, message_count: new_count}}
    end

    @impl true
    def handle_call({:send_messages, round, processes}, _from, state) do
      new_state = %{state | processes: processes}
      
      cond do
        round == 0 && state.id == state.config.source ->
          send_initial_messages(new_state)
        round > 0 ->
          relay_messages(new_state, round)
        true ->
          {:reply, :ok, new_state}
      end
    end

    @impl true
    def handle_call(:decide, _from, state) do
      decision = if state.id == state.config.source do
        # Source process uses its own value
        if state.is_faulty, do: :one, else: :zero
      else
        majority_vote(state)
      end
      
      {:reply, decision, state}
    end

    @impl true
    def handle_call(:is_faulty, _from, state) do
      {:reply, state.is_faulty, state}
    end

    @impl true
    def handle_call(:is_source, _from, state) do
      {:reply, state.id == state.config.source, state}
    end

    @impl true
    def handle_call(:get_message_count, _from, state) do
      {:reply, state.message_count, state}
    end

    @impl true
    def handle_call(:get_values, _from, state) do
      {:reply, state.values, state}
    end

    # Private functions

    defp send_initial_messages(state) do
      base_value = :zero
      
      Enum.each(state.processes, fn {id, pid} ->
        if id != state.id do
          value = if state.is_faulty do
            # Faulty commander sends different values
            if rem(id, 2) == 0, do: :zero, else: :one
          else
            base_value
          end
          
          receive_message(pid, Integer.to_string(state.id), value)
          Logger.debug("Commander #{state.id} sent #{value} to process #{id}")
        end
      end)
      
      {:reply, :ok, state}
    end

    defp relay_messages(state, round) do
      paths = get_paths_for_round(state, round - 1)
      
      Enum.each(paths, fn path ->
        case Map.get(state.values, path) do
          nil -> 
            :ok
          value ->
            new_value = transform_value(state, value)
            new_path = path <> Integer.to_string(state.id)
            
            Enum.each(state.processes, fn {id, pid} ->
              if id != state.id && !String.contains?(new_path, Integer.to_string(id)) do
                receive_message(pid, new_path, new_value)
                Logger.debug("Process #{state.id} relayed #{new_value} to #{id}, path: #{new_path}")
              end
            end)
        end
      end)
      
      {:reply, :ok, state}
    end

    defp transform_value(state, value) do
      if state.is_faulty && state.id == 2 do
        :one
      else
        value
      end
    end

    defp get_paths_for_round(state, round) do
      if round == 0 do
        [Integer.to_string(state.config.source)]
      else
        state.values
        |> Map.keys()
        |> Enum.filter(&(String.length(&1) == round + 1))
      end
    end

    defp majority_vote(state) do
      counts = Enum.reduce(state.values, %{zero: 0, one: 0, retreat: 0}, fn {_path, value}, acc ->
        Map.update!(acc, value, &(&1 + 1))
      end)
      
      Logger.debug("Process #{state.id} vote counts: #{inspect(counts)}")
      
      total_votes = Map.values(counts) |> Enum.sum()
      
      if total_votes == 0 do
        :retreat
      else
        majority_threshold = div(total_votes, 2)
        
        case Enum.find(counts, fn {_value, count} -> count > majority_threshold end) do
          {value, _count} -> value
          nil -> :retreat
        end
      end
    end
  end

  defmodule Engine do
    @moduledoc "Engine that orchestrates the Byzantine Generals algorithm"
    
    require Logger

    def run(source, num_rounds, num_processes, opts \\ []) do
      config = %Configuration{
        source: source,
        num_rounds: num_rounds,
        num_processes: num_processes
      }

      verbose = Keyword.get(opts, :verbose, true)
      
      if verbose do
        Logger.info("Starting Byzantine Generals: #{num_processes} processes, #{num_rounds} rounds, source: #{source}")
      end

      # Start processes
      processes = start_processes(config)
      
      start_time = :os.system_time(:millisecond)
      
      # Run algorithm rounds
      run_rounds(processes, num_rounds)
      
      end_time = :os.system_time(:millisecond)
      duration = end_time - start_time

      # Collect results
      {results, total_messages} = collect_results(processes, config, verbose)
      
      # Clean up
      cleanup_processes(processes)

      {duration, total_messages, results}
    end

    defp start_processes(config) do
      for id <- 0..(config.num_processes - 1) do
        {:ok, pid} = Process.start_link(%{id: id, config: config})
        {id, pid}
      end
    end

    defp run_rounds(processes, num_rounds, timeout \\ 30_000) do
      for round <- 0..(num_rounds - 1) do
        Logger.debug("Starting round #{round}")
        
        tasks = Enum.map(processes, fn {_id, pid} ->
          Task.async(fn -> 
            Process.send_messages(pid, round, processes)
          end)
        end)
        
        try do
          Task.await_many(tasks, timeout)
          # Small delay to ensure message ordering
          :timer.sleep(10)
        catch
          :exit, {:timeout, _} -> 
            Logger.error("Round #{round} timed out")
            throw(:timeout)
        end
      end
      :ok
    end

    defp collect_results(processes, _config, verbose) do
      total_messages = Enum.sum(Enum.map(processes, fn {_id, pid} ->
        Process.get_message_count(pid)
      end))

      results = Enum.map(processes, fn {id, pid} ->
        is_source = Process.is_source?(pid)
        is_faulty = Process.is_faulty?(pid)
        decision = if is_faulty, do: nil, else: Process.decide(pid)
        
        result = %{
          id: id,
          is_source: is_source,
          is_faulty: is_faulty,
          decision: decision
        }
        
        if verbose do
          print_process_result(result)
        end
        
        result
      end)

      {results, total_messages}
    end

    defp print_process_result(%{id: id, is_source: is_source, is_faulty: is_faulty, decision: decision}) do
      prefix = if is_source, do: "Source ", else: ""
      
      if is_faulty do
        IO.puts("#{prefix}Process #{id} is faulty")
      else
        IO.puts("#{prefix}Process #{id} decides on value #{decision}")
      end
    end

    defp cleanup_processes(processes) do
      Enum.each(processes, fn {_id, pid} -> 
        GenServer.stop(pid, :normal, 1000)
      end)
    end

    def benchmark(max_processes, opts \\ []) do
      verbose = Keyword.get(opts, :verbose, true)
      
      if verbose do
        IO.puts("Elixir Byzantine Generals Benchmark")
        IO.puts("===================================")
        IO.puts("Language,Processes,Rounds,Messages,Time(ms)")
      end
      
      test_cases = generate_test_cases(max_processes)
      
      results = Enum.map(test_cases, fn {processes, rounds} ->
        source = div(processes, 3)
        {time, messages, _results} = run(source, rounds, processes, verbose: false)
        
        result = %{
          language: "Elixir",
          processes: processes,
          rounds: rounds,
          messages: messages,
          time_ms: time
        }
        
        if verbose do
          IO.puts("Elixir,#{processes},#{rounds},#{messages},#{time}")
        end
        
        result
      end)
      
      results
    end

    defp generate_test_cases(max_processes) do
      for n <- 4..max_processes, rem(n - 1, 3) == 0 do
        for m <- 1..3, 3 * m + 1 <= n do
          {n, m}
        end
      end
      |> List.flatten()
    end
  end

  # Main module functions

  def run(source, num_rounds, num_processes, opts \\ []) do
    Engine.run(source, num_rounds, num_processes, opts)
  end

  def benchmark(max_processes \\ 20, opts \\ []) do
    Engine.benchmark(max_processes, opts)
  end

  def quick_test do
    IO.puts("Running quick test with 4 processes, 1 round...")
    {time, messages, results} = run(0, 1, 4)
    
    IO.puts("\nTest Results:")
    IO.puts("Time: #{time}ms")
    IO.puts("Messages: #{messages}")
    IO.puts("Processes reached consensus: #{length(results)}")
    IO.puts("? Test completed successfully")
    
    :ok
  end
end

defmodule ByzantineGenerals.Application do
  @moduledoc false
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      # Add supervised processes here if needed
    ]

    opts = [strategy: :one_for_one, name: ByzantineGenerals.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Testing Erlang Implementation

-module(byzantine_generals).
-export([run/3, benchmark/1, quick_test/0, start/0, stop/0]).
-include_lib("kernel/include/logger.hrl").

-record(config, {source, num_rounds, num_processes}).
-record(process_state, {id, config, values, is_faulty, message_count, processes}).

%% Public API

%% Start the application
start() ->
    application:start(byzantine_generals).

%% Stop the application  
stop() ->
    application:stop(byzantine_generals).

%% Run the Byzantine Generals algorithm
run(Source, NumRounds, NumProcesses) ->
    ?LOG_INFO("Starting Byzantine Generals: ~p processes, ~p rounds, source: ~p", 
              [NumProcesses, NumRounds, Source]),
    
    Config = #config{source = Source, num_rounds = NumRounds, num_processes = NumProcesses},
    
    % Validate configuration
    case validate_config(Config) of
        ok -> 
            run_algorithm(Config);
        {error, Reason} ->
            ?LOG_ERROR("Invalid configuration: ~p", [Reason]),
            {error, Reason}
    end.

%% Run benchmark with different configurations
benchmark(MaxProcesses) ->
    ?LOG_INFO("Running Erlang Byzantine Generals Benchmark up to ~p processes", [MaxProcesses]),
    
    io:format("Erlang Byzantine Generals Benchmark~n"),
    io:format("===================================~n"),
    io:format("Language,Processes,Rounds,Messages,Time(ms)~n"),
    
    TestCases = generate_test_cases(MaxProcesses),
    
    Results = lists:map(fun({Processes, Rounds}) ->
        Source = Processes div 3,
        case run(Source, Rounds, Processes) of
            {ok, Time, Messages, _ProcessResults} ->
                io:format("Erlang,~p,~p,~p,~p~n", [Processes, Rounds, Messages, Time]),
                #{language => erlang, processes => Processes, rounds => Rounds, 
                  messages => Messages, time_ms => Time};
            {error, _Reason} ->
                #{error => true, processes => Processes, rounds => Rounds}
        end
    end, TestCases),
    
    Results.

%% Quick test function
quick_test() ->
    io:format("Running quick test with 4 processes, 1 round...~n"),
    case run(0, 1, 4) of
        {ok, Time, Messages, _Results} ->
            io:format("~nTest Results:~n"),
            io:format("Time: ~pms~n", [Time]),
            io:format("Messages: ~p~n", [Messages]),
            io:format("? Test completed successfully~n"),
            ok;
        {error, Reason} ->
            io:format("? Test failed: ~p~n", [Reason]),
            error
    end.

%% Internal functions

validate_config(#config{source = Source, num_rounds = NumRounds, num_processes = NumProcesses}) ->
    if
        NumProcesses < 4 ->
            {error, "Need at least 4 processes for Byzantine Generals Problem"};
        Source >= NumProcesses ->
            {error, "Source must be less than number of processes"};
        NumRounds < 1 ->
            {error, "Need at least 1 round"};
        true ->
            ok
    end.

run_algorithm(Config) ->
    % Start message counter
    CounterPid = spawn_link(fun() -> counter_loop(0) end),
    register(message_counter, CounterPid),
    
    StartTime = erlang:system_time(millisecond),
    
    try
        % Start processes
        ProcessPids = start_processes(Config),
        
        % Initialize processes with neighbor information
        initialize_processes(ProcessPids, Config),
        
        % Run algorithm rounds
        run_rounds(ProcessPids, Config),
        
        % Wait for completion
        timer:sleep(100),
        
        EndTime = erlang:system_time(millisecond),
        Duration = EndTime - StartTime,
        
        % Collect results
        {TotalMessages, ProcessResults} = collect_results(ProcessPids, Config),
        
        % Cleanup
        cleanup_processes(ProcessPids),
        unregister(message_counter),
        exit(CounterPid, normal),
        
        {ok, Duration, TotalMessages, ProcessResults}
        
    catch
        Class:Reason:Stacktrace ->
            ?LOG_ERROR("Algorithm failed: ~p:~p~n~p", [Class, Reason, Stacktrace]),
            % Cleanup on error
            catch unregister(message_counter),
            catch exit(CounterPid, kill),
            {error, {Class, Reason}}
    end.

start_processes(Config) ->
    NumProcesses = Config#config.num_processes,
    lists:map(fun(Id) -> 
        Pid = spawn_link(fun() -> process_loop(Id, Config) end),
        {Id, Pid}
    end, lists:seq(0, NumProcesses - 1)).

initialize_processes(ProcessPids, Config) ->
    lists:foreach(fun({_Id, Pid}) -> 
        Pid ! {init, ProcessPids, Config}
    end, ProcessPids).

run_rounds(ProcessPids, Config) ->
    NumRounds = Config#config.num_rounds,
    lists:foreach(fun(Round) ->
        ?LOG_DEBUG("Starting round ~p", [Round]),
        
        % Send messages for this round
        lists:foreach(fun({_Id, Pid}) ->
            Pid ! {send_messages, Round, self()}
        end, ProcessPids),
        
        % Wait for all processes to complete the round
        lists:foreach(fun({_Id, _Pid}) ->
            receive
                {round_complete, Round} -> ok
            after 5000 ->
                ?LOG_WARNING("Timeout waiting for round ~p completion", [Round])
            end
        end, ProcessPids),
        
        % Small delay between rounds
        timer:sleep(10)
    end, lists:seq(0, NumRounds - 1)).

collect_results(ProcessPids, Config) ->
    % Get total message count
    TotalMessages = get_message_count(),
    
    % Get process results
    ProcessResults = lists:map(fun({Id, Pid}) ->
        Pid ! {get_result, self()},
        receive
            {result, Id, Result} -> 
                print_process_result(Id, Result, Config#config.source),
                Result
        after 2000 ->
            ?LOG_WARNING("Timeout getting result from process ~p", [Id]),
            #{id => Id, error => timeout}
        end
    end, ProcessPids),
    
    {TotalMessages, ProcessResults}.

print_process_result(Id, Result, Source) ->
    Prefix = case Id of
        Source -> "Source ";
        _ -> ""
    end,
    
    case maps:get(is_faulty, Result, false) of
        true ->
            io:format("~sProcess ~p is faulty~n", [Prefix, Id]);
        false ->
            Decision = maps:get(decision, Result, retreat),
            io:format("~sProcess ~p decides on value ~p~n", [Prefix, Id, Decision])
    end.

cleanup_processes(ProcessPids) ->
    lists:foreach(fun({_Id, Pid}) -> 
        Pid ! stop,
        % Don't wait for exit - let them clean up
        ok
    end, ProcessPids).

generate_test_cases(MaxProcesses) ->
    lists:flatten([
        [{N, M} || M <- lists:seq(1, 3), 3 * M + 1 =< N]
        || N <- lists:seq(4, MaxProcesses, 3)
    ]).

%% Process implementation

process_loop(Id, Config) ->
    IsFaulty = (Id =:= Config#config.source) orelse (Id =:= 2),
    State = #process_state{
        id = Id, 
        config = Config, 
        values = #{}, 
        is_faulty = IsFaulty,
        message_count = 0,
        processes = []
    },
    ?LOG_DEBUG("Process ~p initialized, faulty: ~p", [Id, IsFaulty]),
    process_loop(State).

process_loop(State) ->
    receive
        {init, ProcessPids, Config} ->
            NewState = State#process_state{processes = ProcessPids, config = Config},
            process_loop(NewState);
            
        {receive_message, Path, Value} ->
            NewValues = maps:put(Path, Value, State#process_state.values),
            NewState = State#process_state{
                values = NewValues,
                message_count = State#process_state.message_count + 1
            },
            increment_message_count(),
            ?LOG_DEBUG("Process ~p received message: path=~s, value=~p", 
                      [State#process_state.id, Path, Value]),
            process_loop(NewState);
            
        {send_messages, Round, From} ->
            NewState = handle_send_messages(State, Round),
            From ! {round_complete, Round},
            process_loop(NewState);
            
        {get_result, From} ->
            Result = create_result(State),
            From ! {result, State#process_state.id, Result},
            process_loop(State);
            
        stop ->
            ?LOG_DEBUG("Process ~p stopping", [State#process_state.id]),
            ok;
            
        Other ->
            ?LOG_WARNING("Process ~p received unexpected message: ~p", 
                        [State#process_state.id, Other]),
            process_loop(State)
    end.

handle_send_messages(State, Round) ->
    Id = State#process_state.id,
    Config = State#process_state.config,
    
    if 
        Round =:= 0 andalso Id =:= Config#config.source ->
            send_initial_messages(State);
        Round > 0 ->
            relay_messages(State, Round);
        true ->
            State
    end.

send_initial_messages(State) ->
    BaseValue = zero,
    ProcessPids = State#process_state.processes,
    
    lists:foreach(fun({Id, Pid}) ->
        if Id =/= State#process_state.id ->
            Value = case State#process_state.is_faulty of
                true -> 
                    % Faulty commander sends different values
                    case Id rem 2 of
                        0 -> zero;
                        1 -> one
                    end;
                false -> 
                    BaseValue
            end,
            
            Pid ! {receive_message, integer_to_list(State#process_state.id), Value},
            ?LOG_DEBUG("Commander ~p sent ~p to process ~p", 
                      [State#process_state.id, Value, Id]);
        true -> 
            ok
        end
    end, ProcessPids),
    
    State.

relay_messages(State, Round) ->
    Paths = get_paths_for_round(State, Round - 1),
    ProcessPids = State#process_state.processes,
    
    lists:foreach(fun(Path) ->
        case maps:get(Path, State#process_state.values, undefined) of
            undefined -> 
                ok;
            Value ->
                NewValue = transform_value(State, Value),
                NewPath = Path ++ integer_to_list(State#process_state.id),
                
                lists:foreach(fun({Id, Pid}) ->
                    IdStr = integer_to_list(Id),
                    case Id =/= State#process_state.id of
                        true ->
                            case string:str(NewPath, IdStr) of
                                0 -> % IdStr not found in NewPath
                                    Pid ! {receive_message, NewPath, NewValue},
                                    ?LOG_DEBUG("Process ~p relayed ~p to ~p, path: ~s", 
                                              [State#process_state.id, NewValue, Id, NewPath]);
                                _ -> % IdStr found in NewPath, skip
                                    ok
                            end;
                        false -> % Same process, skip
                            ok
                    end
                end, ProcessPids)
        end
    end, Paths),
    
    State.

transform_value(State, Value) ->
    if State#process_state.is_faulty andalso State#process_state.id =:= 2 ->
        one;
    true ->
        Value
    end.

get_paths_for_round(State, Round) ->
    if Round =:= 0 ->
        [integer_to_list((State#process_state.config)#config.source)];
    true ->
        maps:fold(fun(Path, _Value, Acc) ->
            case length(Path) of
                Len when Len =:= Round + 1 -> [Path | Acc];
                _ -> Acc
            end
        end, [], State#process_state.values)
    end.

create_result(State) ->
    Decision = if State#process_state.id =:= (State#process_state.config)#config.source ->
        % Source process uses its own value
        case State#process_state.is_faulty of
            true -> one;
            false -> zero
        end;
    true ->
        majority_vote(State)
    end,
    
    #{
        id => State#process_state.id,
        is_source => State#process_state.id =:= (State#process_state.config)#config.source,
        is_faulty => State#process_state.is_faulty,
        decision => Decision,
        message_count => State#process_state.message_count
    }.

majority_vote(State) ->
    Values = maps:values(State#process_state.values),
    Counts = lists:foldl(fun(Value, Acc) ->
        maps:update_with(Value, fun(Count) -> Count + 1 end, 1, Acc)
    end, #{zero => 0, one => 0, retreat => 0}, Values),
    
    ?LOG_DEBUG("Process ~p vote counts: ~p", [State#process_state.id, Counts]),
    
    TotalVotes = maps:fold(fun(_Value, Count, Sum) -> Sum + Count end, 0, Counts),
    
    if TotalVotes =:= 0 ->
        retreat;
    true ->
        MajorityThreshold = TotalVotes div 2,
        case maps:fold(fun(Value, Count, Acc) ->
            if Count > MajorityThreshold -> Value;
            true -> Acc
            end
        end, retreat, Counts) of
            retreat -> retreat;
            Value -> Value
        end
    end.

%% Message counter implementation

counter_loop(Count) ->
    receive
        increment ->
            counter_loop(Count + 1);
        {get_count, From} ->
            From ! {count, Count},
            counter_loop(Count);
        reset ->
            counter_loop(0);
        stop ->
            ok;
        _ ->
            counter_loop(Count)
    end.

increment_message_count() ->
    case whereis(message_counter) of
        undefined -> ok;
        Pid -> Pid ! increment
    end.

get_message_count() ->
    case whereis(message_counter) of
        undefined -> 0;
        Pid ->
            Pid ! {get_count, self()},
            receive
                {count, Count} -> Count
            after 1000 -> 0
            end
    end.

Performance Analysis and Benchmarking

To properly benchmark these implementations, we need to consider several factors:

Metrics to Measure

  1. Execution Time: How long does the algorithm take?
  2. Message Count: How many messages are exchanged?
  3. Memory Usage: Peak memory consumption
  4. Scalability: How performance degrades with increasing generals
  5. CPU Utilization: How efficiently the languages use system resources

Modern Benchmarking Approach

// Example comprehensive benchmark
pub struct BenchmarkResults {
    pub language: String,
    pub num_processes: usize,
    pub num_rounds: usize,
    pub execution_time_ms: f64,
    pub messages_sent: usize,
    pub memory_peak_mb: f64,
    pub cpu_utilization: f64,
}

pub fn comprehensive_benchmark() {
    let test_cases = vec![
        (4, 1),   // Minimum viable case
        (7, 2),   // Small scale
        (10, 3),  // Medium scale
        (16, 5),  // Larger scale
    ];

    for (processes, rounds) in test_cases {
        // Rust benchmark
        let rust_result = benchmark_rust_detailed(processes, rounds);
        
        // Elixir benchmark (would call via Port)
        let elixir_result = benchmark_elixir_detailed(processes, rounds);
        
        // Erlang benchmark (would call via Port)
        let erlang_result = benchmark_erlang_detailed(processes, rounds);
        
        compare_results(vec![rust_result, elixir_result, erlang_result]);
    }
}

Real-World Implications

The performance characteristics matter significantly in different contexts:

Blockchain Applications

  • Latency-Critical: Rust’s performance advantage matters for high-frequency trading
  • Node Count: Elixir/Erlang’s superior scaling helps with large blockchain networks
  • Fault Tolerance: Actor model languages excel at handling network partitions

IoT and Edge Computing

  • Resource Constraints: Rust’s low memory footprint is crucial
  • Device Coordination: Byzantine fault tolerance becomes critical for autonomous systems

Financial Systems

  • Regulatory Requirements: Provable consensus algorithms are increasingly required
  • High Availability: Erlang’s fault tolerance model aligns with financial system needs

Future Directions

Looking ahead, several trends will likely shape how we think about Byzantine fault tolerance:

  • Quantum Computing: Post-quantum cryptography will change how we implement Byzantine fault-tolerant signatures and may require new consensus mechanisms.
  • Climate Considerations: Energy-efficient consensus mechanisms (like Proof of Stake) are becoming increasingly important as environmental concerns grow.
  • Regulatory Frameworks: Government regulations around cryptocurrencies and distributed systems may influence which Byzantine fault-tolerant algorithms are acceptable in different contexts.
  • Edge and IoT: As computing moves to the edge, Byzantine fault tolerance becomes crucial for coordinating potentially millions of small, unreliable devices.

Performance Analysis

To compare the implementations, I measured complete wall-clock execution time including language runtime startup and algorithm execution across different process counts (10 to 2000 processes) with 1 round each. Each configuration was tested 3 times to ensure consistency. These benchmarks focus on demonstrating algorithmic correctness and relative performance characteristics rather than highly optimized production implementations.

All source code is available at https://github.com/bhatti/byz-sample for those interested in running or improving these implementations.

Results Summary

Complete Execution Time (Wall-Clock) – Updated Results:

  1. Elixir: 535ms average (range: 455-762ms)
  2. Rust: 577ms average (range: 521-667ms)
  3. Erlang: 1460ms average (range: 1401-1629ms)

Detailed Performance Breakdown

ConfigurationElixir (ms)Rust (ms)Erlang (ms)Messages
10 processes47153314078
50 processes476545140647
100 processes528587142091
200 processes4825501425199
1000 processes5685911497998
2000 processes68766116101999

Key Findings

  • Elixir maintained consistent performance across different process counts, showing good scalability characteristics
  • Rust delivered predictable performance with minimal variance, demonstrating excellent memory safety guarantees
  • Erlang showed significantly higher execution times but maintained reliability across all test configurations
  • Message counts remained consistent across languages for equivalent configurations, confirming algorithmic correctness

The results show that as process count increases from 10 to 2000:

  • Elixir scales relatively well, with execution time increasing by ~45%
  • Rust shows similar scaling characteristics, with ~24% increase
  • Erlang maintains consistent performance overhead regardless of scale

Note: These benchmarks measure wall-clock time including runtime startup overhead. The performance differences may be influenced by implementation patterns (GenServer vs raw message passing) and language-specific optimizations rather than fundamental runtime capabilities.

Try It Yourself

The complete implementation is available at https://github.com/bhatti/byz-sample with:

# Clone and run benchmarks
git clone https://github.com/bhatti/byz-sample
cd byz-sample
make benchmark

Disclaimer: Above implementation of the Byzantine Generals Problem serves as a case study for evaluating distributed computing approaches across different programming paradigms rather than benchmarking specific implementations in these languages.

Conclusion

The Byzantine Generals Problem exemplifies how fundamental computer science research can unexpectedly become the foundation for revolutionary technology. What began as an abstract theoretical exercise in 1982 became the backbone of Bitcoin in 2008 and continues to be crucial for modern distributed systems. My 2007 exploration of this problem was motivated by curiosity about distributed computing and language performance. Today, understanding Byzantine fault tolerance is essential for anyone working with blockchain technology, distributed databases, or fault-tolerant systems.

Try the implementations yourself: https://github.com/bhatti/byz-sample


September 9, 2025

Dynamic Facets and Runtime Behavior Composition: Beyond Adaptive Object Models

Filed under: Computing — admin @ 7:28 pm

Background

In my previous blog of the Adaptive Object Model (AOM) pattern, I focused on dynamic schema evolution and metadata-driven architectures. However, there’s a complementary pattern that addresses a different but equally important challenge: how to compose behavior dynamically at runtime without modifying existing objects. I first saw this pattern in Voyager ORB’s “Dynamic Aggregation” and San Francisco Design Patterns: Blueprints for Business Software (Part-IV Dynamic Behavioral Patterns) in early 2000s, which has profound implications for building extensible systems. The facets pattern, also known as dynamic aggregation or extension objects, allows secondary objects (facets) to be attached to primary objects at runtime, effectively extending their capabilities without inheritance or modification. Unlike AOM, which focuses on schema flexibility, facets address behavioral composition – the ability to mix and match capabilities based on runtime requirements.

Facets Pattern

The facets pattern emerged from several key observations about real-world software systems:

  • Interface Segregation: Not every object needs every capability all the time. A User object might need audit trail capabilities in some contexts, caching in others, and validation in yet others.
  • Runtime Composition: The specific mix of capabilities often depends on runtime context – user permissions, configuration settings, or environmental factors that cannot be determined at compile time.
  • Separation of Concerns: Cross-cutting concerns like logging, security, and persistence should be composable without polluting domain objects.

Voyager ORB’s implementation demonstrated these principles elegantly:

// Voyager ORB example - attaching an account facet to an employee
IEmployee employee = new Employee("joe", "234-44-2678");
IFacets facets = Facets.of(employee);
IAccount account = (IAccount) facets.of(IAccount.class);
account.deposit(2000);

The beauty of this approach is that the Employee class knows nothing about accounting capabilities, yet the object can seamlessly provide financial operations when needed.

Modern Implementations

Let’s explore how this pattern can be implemented in modern languages, taking advantage of their unique strengths while maintaining the core principles.

Rust Implementation: Type-Safe Facet Composition

Rust’s type system and trait system provide excellent foundations for type-safe facet composition:

use std::collections::HashMap;
use std::any::{Any, TypeId};
use std::sync::RwLock;

// Core facet trait that all facets must implement
pub trait Facet: Any + Send + Sync {
    fn as_any(&self) -> &dyn Any;
    fn as_any_mut(&mut self) -> &mut dyn Any;
}

// Faceted object that can have facets attached
pub struct FacetedObject {
    facets: RwLock<HashMap<TypeId, Box<dyn Facet>>>,
    core_object: Box<dyn Any + Send + Sync>,
}

impl FacetedObject {
    pub fn new<T: Any + Send + Sync>(core: T) -> Self {
        Self {
            facets: RwLock::new(HashMap::new()),
            core_object: Box::new(core),
        }
    }

    // Attach a facet to this object
    pub fn attach_facet<F: Facet + 'static>(&self, facet: F) -> Result<(), String> {
        let type_id = TypeId::of::<F>();
        let mut facets = self.facets.write()
            .map_err(|_| "Failed to acquire write lock")?;
        
        if facets.contains_key(&type_id) {
            return Err(format!("Facet of type {:?} already attached", type_id));
        }
        
        facets.insert(type_id, Box::new(facet));
        Ok(())
    }

    // Execute an operation that requires a specific facet (safe callback pattern)
    pub fn with_facet<F: Facet + 'static, R>(
        &self, 
        operation: impl FnOnce(&F) -> R
    ) -> Result<R, String> {
        let facets = self.facets.read()
            .map_err(|_| "Failed to acquire read lock")?;
        let type_id = TypeId::of::<F>();
        
        if let Some(facet) = facets.get(&type_id) {
            if let Some(typed_facet) = facet.as_any().downcast_ref::<F>() {
                Ok(operation(typed_facet))
            } else {
                Err("Failed to downcast facet".to_string())
            }
        } else {
            Err(format!("Required facet not found: {:?}", type_id))
        }
    }

    // Execute a mutable operation on a facet
    pub fn with_facet_mut<F: Facet + 'static, R>(
        &self,
        operation: impl FnOnce(&mut F) -> R
    ) -> Result<R, String> {
        let mut facets = self.facets.write()
            .map_err(|_| "Failed to acquire write lock")?;
        let type_id = TypeId::of::<F>();
        
        if let Some(facet) = facets.get_mut(&type_id) {
            if let Some(typed_facet) = facet.as_any_mut().downcast_mut::<F>() {
                Ok(operation(typed_facet))
            } else {
                Err("Failed to downcast facet".to_string())
            }
        } else {
            Err(format!("Required facet not found: {:?}", type_id))
        }
    }

    // Check if a facet is attached
    pub fn has_facet<F: Facet + 'static>(&self) -> bool {
        let facets = self.facets.read().unwrap();
        let type_id = TypeId::of::<F>();
        facets.contains_key(&type_id)
    }

    // Get the core object
    pub fn get_core<T: 'static>(&self) -> Option<&T> {
        self.core_object.downcast_ref::<T>()
    }
}

// Example domain object
#[derive(Debug)]
pub struct Employee {
    pub name: String,
    pub id: String,
    pub department: String,
}

impl Employee {
    pub fn new(name: &str, id: &str, department: &str) -> Self {
        Self {
            name: name.to_string(),
            id: id.to_string(),
            department: department.to_string(),
        }
    }
}

// Account facet for financial operations
#[derive(Debug)]
pub struct AccountFacet {
    balance: f64,
    account_number: String,
}

impl AccountFacet {
    pub fn new(account_number: &str) -> Self {
        Self {
            balance: 0.0,
            account_number: account_number.to_string(),
        }
    }

    pub fn deposit(&mut self, amount: f64) -> Result<f64, String> {
        if amount <= 0.0 {
            return Err("Deposit amount must be positive".to_string());
        }
        self.balance += amount;
        Ok(self.balance)
    }

    pub fn withdraw(&mut self, amount: f64) -> Result<f64, String> {
        if amount <= 0.0 {
            return Err("Withdrawal amount must be positive".to_string());
        }
        if amount > self.balance {
            return Err("Insufficient funds".to_string());
        }
        self.balance -= amount;
        Ok(self.balance)
    }

    pub fn get_balance(&self) -> f64 {
        self.balance
    }

    pub fn get_account_number(&self) -> &str {
        &self.account_number
    }
}

impl Facet for AccountFacet {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn as_any_mut(&mut self) -> &mut dyn Any {
        self
    }
}

// Audit trail facet for tracking operations
#[derive(Debug)]
pub struct AuditFacet {
    entries: Vec<AuditEntry>,
}

#[derive(Debug, Clone)]
pub struct AuditEntry {
    timestamp: std::time::SystemTime,
    operation: String,
    details: String,
}

impl AuditFacet {
    pub fn new() -> Self {
        Self {
            entries: Vec::new(),
        }
    }

    pub fn log_operation(&mut self, operation: &str, details: &str) {
        self.entries.push(AuditEntry {
            timestamp: std::time::SystemTime::now(),
            operation: operation.to_string(),
            details: details.to_string(),
        });
    }

    pub fn get_audit_trail(&self) -> &[AuditEntry] {
        &self.entries
    }

    pub fn get_recent_entries(&self, count: usize) -> &[AuditEntry] {
        let start = if self.entries.len() > count {
            self.entries.len() - count
        } else {
            0
        };
        &self.entries[start..]
    }
}

impl Facet for AuditFacet {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn as_any_mut(&mut self) -> &mut dyn Any {
        self
    }
}

// Permission facet for access control
#[derive(Debug)]
pub struct PermissionFacet {
    permissions: HashMap<String, bool>,
    role: String,
}

impl PermissionFacet {
    pub fn new(role: &str) -> Self {
        let mut permissions = HashMap::new();
        
        // Define role-based permissions
        match role {
            "admin" => {
                permissions.insert("read".to_string(), true);
                permissions.insert("write".to_string(), true);
                permissions.insert("delete".to_string(), true);
                permissions.insert("financial_operations".to_string(), true);
            },
            "manager" => {
                permissions.insert("read".to_string(), true);
                permissions.insert("write".to_string(), true);
                permissions.insert("financial_operations".to_string(), true);
            },
            "employee" => {
                permissions.insert("read".to_string(), true);
            },
            _ => {}
        }

        Self {
            permissions,
            role: role.to_string(),
        }
    }

    pub fn has_permission(&self, permission: &str) -> bool {
        self.permissions.get(permission).copied().unwrap_or(false)
    }

    pub fn grant_permission(&mut self, permission: &str) {
        self.permissions.insert(permission.to_string(), true);
    }

    pub fn revoke_permission(&mut self, permission: &str) {
        self.permissions.insert(permission.to_string(), false);
    }

    pub fn get_role(&self) -> &str {
        &self.role
    }
}

impl Facet for PermissionFacet {
    fn as_any(&self) -> &dyn Any {
        self
    }

    fn as_any_mut(&mut self) -> &mut dyn Any {
        self
    }
}

// Composite operations that work across facets
pub struct EmployeeOperations;

impl EmployeeOperations {
    pub fn perform_financial_operation<F>(
        employee_obj: &FacetedObject,
        mut operation: F,
    ) -> Result<String, String> 
    where
        F: FnMut(&mut AccountFacet) -> Result<f64, String>,
    {
        // Check permissions first
        let has_permission = employee_obj.with_facet::<PermissionFacet, bool>(|permissions| {
            permissions.has_permission("financial_operations")
        }).unwrap_or(false);

        if !has_permission {
            return Err("Access denied: insufficient permissions for financial operations".to_string());
        }

        // Get employee info for logging
        let employee_name = employee_obj.get_core::<Employee>()
            .map(|emp| emp.name.clone())
            .unwrap_or_else(|| "Unknown".to_string());

        // Perform the operation
        let result = employee_obj.with_facet_mut::<AccountFacet, Result<f64, String>>(|account| {
            operation(account)
        })?;

        let balance = result?;

        // Log the operation if audit facet is present
        let _ = employee_obj.with_facet_mut::<AuditFacet, ()>(|audit| {
            audit.log_operation("financial_operation", &format!("New balance: {}", balance));
        });

        Ok(format!("Financial operation completed for {}. New balance: {}", employee_name, balance))
    }

    pub fn get_employee_summary(employee_obj: &FacetedObject) -> String {
        let mut summary = String::new();

        // Core employee information
        if let Some(employee) = employee_obj.get_core::<Employee>() {
            summary.push_str(&format!("Employee: {} (ID: {})\n", employee.name, employee.id));
            summary.push_str(&format!("Department: {}\n", employee.department));
        }

        // Account information if available
        let account_info = employee_obj.with_facet::<AccountFacet, String>(|account| {
            format!("Account: {} (Balance: ${:.2})\n", 
                account.get_account_number(), account.get_balance())
        }).unwrap_or_else(|_| "No account information\n".to_string());
        summary.push_str(&account_info);

        // Permission information if available
        let permission_info = employee_obj.with_facet::<PermissionFacet, String>(|permissions| {
            format!("Role: {}\n", permissions.get_role())
        }).unwrap_or_else(|_| "No permission information\n".to_string());
        summary.push_str(&permission_info);

        // Audit information if available
        let audit_info = employee_obj.with_facet::<AuditFacet, String>(|audit| {
            let recent_entries = audit.get_recent_entries(3);
            if !recent_entries.is_empty() {
                let mut info = "Recent Activity:\n".to_string();
                for entry in recent_entries {
                    info.push_str(&format!("  - {:?}: {} ({})\n", 
                        entry.timestamp,
                        entry.operation, 
                        entry.details));
                }
                info
            } else {
                "No recent activity\n".to_string()
            }
        }).unwrap_or_else(|_| "No audit information\n".to_string());
        summary.push_str(&audit_info);

        summary
    }
}

// Usage example
fn example_usage() -> Result<(), String> {
    println!("=== Dynamic Facets Example ===");

    // Create an employee
    let employee = Employee::new("Alice Johnson", "EMP001", "Engineering");
    let employee_obj = FacetedObject::new(employee);

    // Attach different facets based on requirements
    employee_obj.attach_facet(AccountFacet::new("ACC001"))?;
    employee_obj.attach_facet(PermissionFacet::new("manager"))?;
    employee_obj.attach_facet(AuditFacet::new())?;

    println!("Facets attached successfully!");

    // Use facets through the composite object
    let summary = EmployeeOperations::get_employee_summary(&employee_obj);
    println!("\nEmployee Summary:\n{}", summary);

    // Attempt financial operation (deposit)
    let result = EmployeeOperations::perform_financial_operation(
        &employee_obj,
        |account| account.deposit(1000.0)
    )?;
    println!("Deposit result: {}", result);

    // Attempt another financial operation (withdrawal)
    let result = EmployeeOperations::perform_financial_operation(
        &employee_obj,
        |account| account.withdraw(250.0)
    )?;
    println!("Withdrawal result: {}", result);

    // Display final summary
    let final_summary = EmployeeOperations::get_employee_summary(&employee_obj);
    println!("\nFinal Employee Summary:\n{}", final_summary);

    Ok(())
}

fn main() {
    match example_usage() {
        Ok(_) => println!("\nFacet composition example completed successfully."),
        Err(e) => eprintln!("Error: {}", e),
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_facet_attachment() {
        let employee = Employee::new("Test User", "TEST001", "Engineering");
        let employee_obj = FacetedObject::new(employee);

        // Test attaching facets
        assert!(employee_obj.attach_facet(AccountFacet::new("ACC001")).is_ok());
        assert!(employee_obj.has_facet::<AccountFacet>());

        // Test duplicate attachment fails
        assert!(employee_obj.attach_facet(AccountFacet::new("ACC002")).is_err());
    }

    #[test]
    fn test_financial_operations() {
        let employee = Employee::new("Test User", "TEST001", "Engineering");
        let employee_obj = FacetedObject::new(employee);

        employee_obj.attach_facet(AccountFacet::new("ACC001")).unwrap();
        employee_obj.attach_facet(PermissionFacet::new("manager")).unwrap();

        // Test deposit
        let result = employee_obj.with_facet_mut::<AccountFacet, Result<f64, String>>(|account| {
            account.deposit(1000.0)
        }).unwrap();

        assert_eq!(result.unwrap(), 1000.0);

        // Test balance check
        let balance = employee_obj.with_facet::<AccountFacet, f64>(|account| {
            account.get_balance()
        }).unwrap();

        assert_eq!(balance, 1000.0);
    }

    #[test]
    fn test_permission_checking() {
        let employee = Employee::new("Test User", "TEST001", "Engineering");
        let employee_obj = FacetedObject::new(employee);

        employee_obj.attach_facet(PermissionFacet::new("employee")).unwrap();

        let has_financial = employee_obj.with_facet::<PermissionFacet, bool>(|permissions| {
            permissions.has_permission("financial_operations")
        }).unwrap();

        assert_eq!(has_financial, false);

        let has_read = employee_obj.with_facet::<PermissionFacet, bool>(|permissions| {
            permissions.has_permission("read")
        }).unwrap();

        assert_eq!(has_read, true);
    }
}

The Rust implementation provides several key advantages:

  • Type Safety: The type system ensures that facets can only be cast to their correct types
  • Memory Safety: Rust’s ownership model prevents common issues with shared mutable state
  • Performance: Zero-cost abstractions mean the facet system has minimal runtime overhead
  • Concurrency: Built-in thread safety through Send and Sync traits

TypeScript Implementation: Dynamic Composition with Type Safety

TypeScript’s type system allows for sophisticated compile-time checking while maintaining JavaScript’s dynamic nature:

// Base interfaces for the facet system

// Base interfaces for the facet system
interface Facet {
  readonly facetType: string;
}

interface FacetConstructor<T extends Facet> {
  new(...args: any[]): T;
  readonly facetType: string;
}

// Core faceted object implementation
class FacetedObject<TCore = any> {
  private facets: Map<string, Facet> = new Map();
  private core: TCore;

  constructor(core: TCore) {
    this.core = core;
  }

  // Attach a facet to this object
  attachFacet<T extends Facet>(FacetClass: FacetConstructor<T>, ...args: any[]): T {
    const facet = new FacetClass(...args);
    
    if (this.facets.has(FacetClass.facetType)) {
      throw new Error(`Facet ${FacetClass.facetType} already attached`);
    }
    
    this.facets.set(FacetClass.facetType, facet);
    return facet;
  }

  // Get a facet by its constructor
  getFacet<T extends Facet>(FacetClass: FacetConstructor<T>): T | undefined {
    const facet = this.facets.get(FacetClass.facetType);
    return facet as T | undefined;
  }

  // Check if a facet is attached
  hasFacet<T extends Facet>(FacetClass: FacetConstructor<T>): boolean {
    return this.facets.has(FacetClass.facetType);
  }

  // Remove a facet
  removeFacet<T extends Facet>(FacetClass: FacetConstructor<T>): boolean {
    return this.facets.delete(FacetClass.facetType);
  }

  // Get the core object
  getCore(): TCore {
    return this.core;
  }

  // Execute operation with facet requirement checking
  withFacet<T extends Facet, R>(
    FacetClass: FacetConstructor<T>,
    operation: (facet: T) => R
  ): R {
    const facet = this.getFacet(FacetClass);
    if (!facet) {
      throw new Error(`Required facet ${FacetClass.facetType} not found`);
    }
    return operation(facet);
  }

  // Get all attached facet types
  getAttachedFacetTypes(): string[] {
    return Array.from(this.facets.keys());
  }
}

// Example domain objects
interface Employee {
  name: string;
  id: string;
  department: string;
  email: string;
}

class EmployeeImpl implements Employee {
  constructor(
    public name: string,
    public id: string,
    public department: string,
    public email: string
  ) {}
}

// Account facet for financial operations
class AccountFacet implements Facet {
  static readonly facetType = 'account';
  readonly facetType = AccountFacet.facetType;

  private balance: number = 0;
  private accountNumber: string;
  private transactions: Transaction[] = [];

  constructor(accountNumber: string, initialBalance: number = 0) {
    this.accountNumber = accountNumber;
    this.balance = initialBalance;
  }

  deposit(amount: number): number {
    if (amount <= 0) {
      throw new Error('Deposit amount must be positive');
    }
    
    this.balance += amount;
    this.transactions.push({
      type: 'deposit',
      amount,
      timestamp: new Date(),
      balanceAfter: this.balance
    });
    
    return this.balance;
  }

  withdraw(amount: number): number {
    if (amount <= 0) {
      throw new Error('Withdrawal amount must be positive');
    }
    
    if (amount > this.balance) {
      throw new Error('Insufficient funds');
    }
    
    this.balance -= amount;
    this.transactions.push({
      type: 'withdrawal',
      amount,
      timestamp: new Date(),
      balanceAfter: this.balance
    });
    
    return this.balance;
  }

  getBalance(): number {
    return this.balance;
  }

  getAccountNumber(): string {
    return this.accountNumber;
  }

  getTransactionHistory(): Transaction[] {
    return [...this.transactions];
  }

  getRecentTransactions(count: number): Transaction[] {
    return this.transactions.slice(-count);
  }
}

interface Transaction {
  type: 'deposit' | 'withdrawal';
  amount: number;
  timestamp: Date;
  balanceAfter: number;
}

// Notification facet for alerting
class NotificationFacet implements Facet {
  static readonly facetType = 'notification';
  readonly facetType = NotificationFacet.facetType;

  private subscribers: Map<string, NotificationHandler[]> = new Map();

  subscribe(eventType: string, handler: NotificationHandler): void {
    if (!this.subscribers.has(eventType)) {
      this.subscribers.set(eventType, []);
    }
    this.subscribers.get(eventType)!.push(handler);
  }

  unsubscribe(eventType: string, handler: NotificationHandler): boolean {
    const handlers = this.subscribers.get(eventType);
    if (!handlers) return false;
    
    const index = handlers.indexOf(handler);
    if (index !== -1) {
      handlers.splice(index, 1);
      return true;
    }
    return false;
  }

  notify(eventType: string, data: any): void {
    const handlers = this.subscribers.get(eventType) || [];
    handlers.forEach(handler => {
      try {
        handler(eventType, data);
      } catch (error) {
        const errorMessage = error instanceof Error ? error.message : 'Unknown error';
        console.error(`Notification handler error for ${eventType}:`, errorMessage);
      }
    });
  }

  getSubscriberCount(eventType: string): number {
    return this.subscribers.get(eventType)?.length || 0;
  }
}

type NotificationHandler = (eventType: string, data: any) => void;

// Cache facet for performance optimization
class CacheFacet implements Facet {
  static readonly facetType = 'cache';
  readonly facetType = CacheFacet.facetType;

  private cache: Map<string, CacheEntry> = new Map();
  private maxSize: number;
  private defaultTTL: number;

  constructor(maxSize: number = 100, defaultTTL: number = 300000) { // 5 minutes default
    this.maxSize = maxSize;
    this.defaultTTL = defaultTTL;
  }

  set<T>(key: string, value: T, ttl?: number): void {
    // Remove oldest entries if cache is full
    if (this.cache.size >= this.maxSize) {
      const oldestKey = this.cache.keys().next().value;
      if (oldestKey !== undefined) {
        this.cache.delete(oldestKey);
      }
    }

    this.cache.set(key, {
      value,
      timestamp: Date.now(),
      ttl: ttl || this.defaultTTL
    });
  }

  get<T>(key: string): T | undefined {
    const entry = this.cache.get(key);
    if (!entry) return undefined;

    // Check if entry has expired
    if (Date.now() - entry.timestamp > entry.ttl) {
      this.cache.delete(key);
      return undefined;
    }

    return entry.value as T;
  }

  has(key: string): boolean {
    const entry = this.cache.get(key);
    if (!entry) return false;

    // Check if entry has expired
    if (Date.now() - entry.timestamp > entry.ttl) {
      this.cache.delete(key);
      return false;
    }

    return true;
  }

  invalidate(key: string): boolean {
    return this.cache.delete(key);
  }

  clear(): void {
    this.cache.clear();
  }

  getStats(): CacheStats {
    return {
      size: this.cache.size,
      maxSize: this.maxSize,
      hitRate: 0 // Would need to track hits/misses for real implementation
    };
  }
}

interface CacheEntry {
  value: any;
  timestamp: number;
  ttl: number;
}

interface CacheStats {
  size: number;
  maxSize: number;
  hitRate: number;
}

// Permission facet with role-based access control
class PermissionFacet implements Facet {
  static readonly facetType = 'permission';
  readonly facetType = PermissionFacet.facetType;

  private permissions: Set<string> = new Set();
  private role: string;

  constructor(role: string) {
    this.role = role;
    this.initializeRolePermissions(role);
  }

  private initializeRolePermissions(role: string): void {
    const rolePermissions: Record<string, string[]> = {
      'admin': ['read', 'write', 'delete', 'financial', 'admin'],
      'manager': ['read', 'write', 'financial', 'manage_team'],
      'employee': ['read', 'view_profile'],
      'guest': ['read']
    };

    const perms = rolePermissions[role] || [];
    perms.forEach(perm => this.permissions.add(perm));
  }

  hasPermission(permission: string): boolean {
    return this.permissions.has(permission);
  }

  grantPermission(permission: string): void {
    this.permissions.add(permission);
  }

  revokePermission(permission: string): void {
    this.permissions.delete(permission);
  }

  getPermissions(): string[] {
    return Array.from(this.permissions);
  }

  getRole(): string {
    return this.role;
  }

  requirePermission(permission: string): void {
    if (!this.hasPermission(permission)) {
      throw new Error(`Access denied: missing permission '${permission}'`);
    }
  }
}

// Composite operations using multiple facets
class EmployeeService {
  static performSecureFinancialOperation(
    employeeObj: FacetedObject<Employee>,
    operation: (account: AccountFacet) => number,
    operationType: string
  ): number {
    // Check permissions
    const permissions = employeeObj.getFacet(PermissionFacet);
    if (permissions) {
      permissions.requirePermission('financial');
    }

    // Perform operation
    const result = employeeObj.withFacet(AccountFacet, operation);

    // Send notification if facet is available
    const notifications = employeeObj.getFacet(NotificationFacet);
    if (notifications) {
      notifications.notify('financial_operation', {
        employee: employeeObj.getCore().name,
        operation: operationType,
        timestamp: new Date()
      });
    }

    // Invalidate related cache entries
    const cache = employeeObj.getFacet(CacheFacet);
    if (cache) {
      cache.invalidate(`balance_${employeeObj.getCore().id}`);
      cache.invalidate(`transactions_${employeeObj.getCore().id}`);
    }

    return result;
  }

  static getEmployeeSummary(employeeObj: FacetedObject<Employee>): string {
    const employee = employeeObj.getCore();
    const facetTypes = employeeObj.getAttachedFacetTypes();
    
    let summary = `Employee: ${employee.name} (${employee.id})\n`;
    summary += `Department: ${employee.department}\n`;
    summary += `Email: ${employee.email}\n`;
    summary += `Active Facets: ${facetTypes.join(', ')}\n`;

    // Add account information if available
    const account = employeeObj.getFacet(AccountFacet);
    if (account) {
      summary += `Account: ${account.getAccountNumber()} (Balance: $${account.getBalance().toFixed(2)})\n`;
      
      const recentTransactions = account.getRecentTransactions(3);
      if (recentTransactions.length > 0) {
        summary += 'Recent Transactions:\n';
        recentTransactions.forEach(tx => {
          summary += `  ${tx.type}: $${tx.amount.toFixed(2)} on ${tx.timestamp.toLocaleString()}\n`;
        });
      }
    }

    // Add permission information if available
    const permissions = employeeObj.getFacet(PermissionFacet);
    if (permissions) {
      summary += `Role: ${permissions.getRole()}\n`;
      summary += `Permissions: ${permissions.getPermissions().join(', ')}\n`;
    }

    // Add cache stats if available
    const cache = employeeObj.getFacet(CacheFacet);
    if (cache) {
      const stats = cache.getStats();
      summary += `Cache: ${stats.size}/${stats.maxSize} entries\n`;
    }

    return summary;
  }

  static configureEmployeeCapabilities(
    employeeObj: FacetedObject<Employee>,
    config: EmployeeConfig
  ): void {
    // Attach facets based on configuration
    if (config.hasAccount) {
      employeeObj.attachFacet(AccountFacet, config.accountNumber, config.initialBalance);
    }

    if (config.role) {
      employeeObj.attachFacet(PermissionFacet, config.role);
    }

    if (config.enableNotifications) {
      const notifications = employeeObj.attachFacet(NotificationFacet);
      
      // Set up default notification handlers
      notifications.subscribe('financial_operation', (eventType, data) => {
        console.log(`Financial operation performed: ${JSON.stringify(data)}`);
      });
    }

    if (config.enableCaching) {
      employeeObj.attachFacet(CacheFacet, config.cacheSize, config.cacheTTL);
    }
  }
}

interface EmployeeConfig {
  hasAccount?: boolean;
  accountNumber?: string;
  initialBalance?: number;
  role?: string;
  enableNotifications?: boolean;
  enableCaching?: boolean;
  cacheSize?: number;
  cacheTTL?: number;
}

// Usage example
function demonstrateFacetComposition(): void {
  console.log('=== Dynamic Facet Composition Demo ===');

  // Create an employee
  const employee = new EmployeeImpl('Bob Smith', 'EMP002', 'Finance', 'bob.smith@company.com');
  const employeeObj = new FacetedObject(employee);

  // Configure capabilities based on requirements
  EmployeeService.configureEmployeeCapabilities(employeeObj, {
    hasAccount: true,
    accountNumber: 'ACC002',
    initialBalance: 500,
    role: 'manager',
    enableNotifications: true,
    enableCaching: true,
    cacheSize: 50,
    cacheTTL: 600000 // 10 minutes
  });

  // Display initial summary
  console.log('\nInitial Employee Summary:');
  console.log(EmployeeService.getEmployeeSummary(employeeObj));

  // Perform financial operations
  try {
    const newBalance = EmployeeService.performSecureFinancialOperation(
      employeeObj,
      (account) => account.deposit(1000),
      'deposit'
    );
    console.log(`Deposit successful. New balance: $${newBalance.toFixed(2)}`);

    const finalBalance = EmployeeService.performSecureFinancialOperation(
      employeeObj,
      (account) => account.withdraw(200),
      'withdrawal'
    );
    console.log(`Withdrawal successful. Final balance: $${finalBalance.toFixed(2)}`);

  } catch (error) {
    const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred';
    console.error('Operation failed:', errorMessage);
  }

  // Display final summary
  console.log('\nFinal Employee Summary:');
  console.log(EmployeeService.getEmployeeSummary(employeeObj));
}

// Run the demonstration
demonstrateFacetComposition();

The TypeScript implementation provides:

  • Type Safety: Compile-time type checking for facet operations
  • IntelliSense Support: Rich IDE support with autocompletion and error detection
  • Interface Segregation: Clean separation between different capabilities
  • Dynamic Composition: Runtime attachment and detachment of behaviors

Ruby Implementation: Metaprogramming-Powered Facets

Ruby’s metaprogramming capabilities make facet implementation particularly elegant:

require 'date'
require 'set'
require 'json'

# Core facet module that all facets include
module Facet
  def self.included(base)
    base.extend(ClassMethods)
  end

  module ClassMethods
    def facet_type
      @facet_type ||= name.downcase.gsub(/facet$/, '')
    end

    def facet_type=(type)
      @facet_type = type
    end
  end

  def facet_type
    self.class.facet_type
  end
end

# Main faceted object implementation
class FacetedObject
  def initialize(core_object)
    @core_object = core_object
    @facets = {}
    @method_cache = {}
    
    # Enable method delegation
    extend_with_facet_methods
  end

  def attach_facet(facet_instance)
    facet_type = facet_instance.facet_type
    
    if @facets.key?(facet_type)
      raise ArgumentError, "Facet '#{facet_type}' already attached"
    end

    @facets[facet_type] = facet_instance
    
    # Add facet methods to this instance
    add_facet_methods(facet_instance)
    
    # Call initialization hook if facet defines it
    facet_instance.on_attached(self) if facet_instance.respond_to?(:on_attached)
    
    facet_instance
  end

  def detach_facet(facet_type_or_class)
    facet_type = case facet_type_or_class
                 when String
                   facet_type_or_class
                 when Class
                   facet_type_or_class.facet_type
                 else
                   facet_type_or_class.facet_type
                 end

    facet = @facets.delete(facet_type)
    
    if facet
      # Remove facet methods
      remove_facet_methods(facet)
      
      # Call cleanup hook if facet defines it
      facet.on_detached(self) if facet.respond_to?(:on_detached)
    end
    
    facet
  end

  def get_facet(facet_type_or_class)
    facet_type = case facet_type_or_class
                 when String
                   facet_type_or_class
                 when Class
                   facet_type_or_class.facet_type
                 else
                   facet_type_or_class.facet_type
                 end

    @facets[facet_type]
  end

  def has_facet?(facet_type_or_class)
    !get_facet(facet_type_or_class).nil?
  end

  def facet_types
    @facets.keys
  end

  def core_object
    @core_object
  end

  def with_facet(facet_type_or_class)
    facet = get_facet(facet_type_or_class)
    raise ArgumentError, "Facet not found: #{facet_type_or_class}" unless facet
    
    yield(facet)
  end

  # Require specific facets for an operation
  def requires_facets(*facet_types, &block)
    missing_facets = facet_types.select { |type| !has_facet?(type) }
    
    unless missing_facets.empty?
      raise ArgumentError, "Missing required facets: #{missing_facets.join(', ')}"
    end
    
    block.call(self) if block_given?
  end

  private

  def extend_with_facet_methods
    # Add method_missing to handle facet method calls
    singleton_class.class_eval do
      define_method :method_missing do |method_name, *args, &block|
        # Try to find the method in attached facets
        @facets.values.each do |facet|
          if facet.respond_to?(method_name)
            return facet.send(method_name, *args, &block)
          end
        end
        
        # Try the core object
        if @core_object.respond_to?(method_name)
          return @core_object.send(method_name, *args, &block)
        end
        
        super(method_name, *args, &block)
      end

      define_method :respond_to_missing? do |method_name, include_private = false|
        @facets.values.any? { |facet| facet.respond_to?(method_name, include_private) } ||
          @core_object.respond_to?(method_name, include_private) ||
          super(method_name, include_private)
      end
    end
  end

  def add_facet_methods(facet)
    facet.public_methods(false).each do |method_name|
      next if method_name == :facet_type

      # Create a delegating method for each public method of the facet
      singleton_class.class_eval do
        define_method("#{facet.facet_type}_#{method_name}") do |*args, &block|
          facet.send(method_name, *args, &block)
        end
      end
    end
  end

  def remove_facet_methods(facet)
    facet.public_methods(false).each do |method_name|
      method_to_remove = "#{facet.facet_type}_#{method_name}"
      
      if respond_to?(method_to_remove)
        singleton_class.class_eval do
          remove_method(method_to_remove) if method_defined?(method_to_remove)
        end
      end
    end
  end
end

# Example domain class
class Employee
  attr_accessor :name, :id, :department, :email, :hire_date

  def initialize(name, id, department, email, hire_date = Date.today)
    @name = name
    @id = id
    @department = department
    @email = email
    @hire_date = hire_date
  end

  def years_of_service
    ((Date.today - @hire_date) / 365.25).to_i
  end

  def to_h
    {
      name: @name,
      id: @id,
      department: @department,
      email: @email,
      hire_date: @hire_date,
      years_of_service: years_of_service
    }
  end
end

# Account facet for financial operations
class AccountFacet
  include Facet
  
  attr_reader :account_number, :balance

  def initialize(account_number, initial_balance = 0)
    @account_number = account_number
    @balance = initial_balance.to_f
    @transactions = []
  end

  def deposit(amount)
    raise ArgumentError, "Amount must be positive" unless amount > 0
    
    @balance += amount
    log_transaction('deposit', amount)
    @balance
  end

  def withdraw(amount)
    raise ArgumentError, "Amount must be positive" unless amount > 0
    raise ArgumentError, "Insufficient funds" if amount > @balance
    
    @balance -= amount
    log_transaction('withdrawal', amount)
    @balance
  end

  def transfer_to(target_account_number, amount)
    raise ArgumentError, "Cannot transfer to same account" if target_account_number == @account_number
    
    withdraw(amount)
    log_transaction('transfer_out', amount, target_account_number)
    amount
  end

  def receive_transfer(from_account_number, amount)
    deposit(amount)
    log_transaction('transfer_in', amount, from_account_number)
    @balance
  end

  def transaction_history(limit = nil)
    limit ? @transactions.last(limit) : @transactions.dup
  end

  def monthly_summary(year, month)
    start_date = Date.new(year, month, 1)
    end_date = start_date.next_month - 1
    
    monthly_transactions = @transactions.select do |tx|
      tx[:timestamp].to_date.between?(start_date, end_date)
    end

    {
      period: "#{year}-#{month.to_s.rjust(2, '0')}",
      transactions: monthly_transactions,
      total_deposits: monthly_transactions.select { |tx| tx[:type] == 'deposit' }.sum { |tx| tx[:amount] },
      total_withdrawals: monthly_transactions.select { |tx| tx[:type] == 'withdrawal' }.sum { |tx| tx[:amount] }
    }
  end

  private

  def log_transaction(type, amount, reference = nil)
    @transactions << {
      type: type,
      amount: amount,
      balance_after: @balance,
      timestamp: Time.now,
      reference: reference
    }
  end
end

# Performance tracking facet
class PerformanceFacet
  include Facet
  
  def initialize
    @metrics = {}
    @goals = {}
    @reviews = []
  end

  def set_metric(name, value, period = Date.today)
    @metrics[name] ||= []
    @metrics[name] << { value: value, period: period, timestamp: Time.now }
  end

  def get_metric(name, period = nil)
    return nil unless @metrics[name]
    
    if period
      @metrics[name].find { |m| m[:period] == period }&.fetch(:value)
    else
      @metrics[name].last&.fetch(:value)
    end
  end

  def set_goal(name, target_value, deadline)
    @goals[name] = { target: target_value, deadline: deadline, set_on: Date.today }
  end

  def goal_progress(name)
    goal = @goals[name]
    return nil unless goal
    
    current_value = get_metric(name)
    return nil unless current_value
    
    progress = (current_value.to_f / goal[:target]) * 100
    {
      goal: goal,
      current_value: current_value,
      progress_percentage: progress.round(2),
      days_remaining: (goal[:deadline] - Date.today).to_i
    }
  end

  def add_review(rating, comments, reviewer, review_date = Date.today)
    @reviews << {
      rating: rating,
      comments: comments,
      reviewer: reviewer,
      review_date: review_date,
      timestamp: Time.now
    }
  end

  def average_rating(last_n_reviews = nil)
    reviews_to_consider = last_n_reviews ? @reviews.last(last_n_reviews) : @reviews
    return 0 if reviews_to_consider.empty?
    
    total = reviews_to_consider.sum { |review| review[:rating] }
    (total.to_f / reviews_to_consider.size).round(2)
  end

  def performance_summary
    {
      metrics: @metrics.transform_values { |values| values.last },
      goals: @goals.transform_values { |goal| goal_progress(@goals.key(goal)) },
      recent_reviews: @reviews.last(3),
      average_rating: average_rating,
      total_reviews: @reviews.size
    }
  end
end

# Security facet for access control and audit
class SecurityFacet
  include Facet
  
  def initialize(security_level = 'basic')
    @security_level = security_level
    @access_log = []
    @failed_attempts = []
    @permissions = Set.new
    @active_sessions = {}
    
    setup_default_permissions
  end

  def authenticate(credentials)
    # Simulate authentication
    success = credentials[:password] == 'secret123'
    
    log_access_attempt(credentials[:user_id], success)
    
    if success
      session_id = generate_session_id
      @active_sessions[session_id] = {
        user_id: credentials[:user_id],
        start_time: Time.now,
        last_activity: Time.now
      }
      session_id
    else
      nil
    end
  end

  def validate_session(session_id)
    session = @active_sessions[session_id]
    return false unless session
    
    # Check session timeout (30 minutes)
    if Time.now - session[:last_activity] > 1800
      @active_sessions.delete(session_id)
      return false
    end
    
    session[:last_activity] = Time.now
    true
  end

  def logout(session_id)
    @active_sessions.delete(session_id)
  end

  def grant_permission(permission)
    @permissions.add(permission)
  end

  def revoke_permission(permission)
    @permissions.delete(permission)
  end

  def has_permission?(permission)
    @permissions.include?(permission) || @permissions.include?('admin')
  end

  def require_permission(permission)
    unless has_permission?(permission)
      raise SecurityError, "Access denied: missing permission '#{permission}'"
    end
  end

  def security_report
    {
      security_level: @security_level,
      permissions: @permissions.to_a,
      active_sessions: @active_sessions.size,
      recent_access_attempts: @access_log.last(10),
      failed_attempts_today: failed_attempts_today.size,
      total_access_attempts: @access_log.size
    }
  end

  private

  def setup_default_permissions
    case @security_level
    when 'admin'
      @permissions.merge(['read', 'write', 'delete', 'admin', 'financial'])
    when 'manager'
      @permissions.merge(['read', 'write', 'financial'])
    when 'employee'
      @permissions.merge(['read'])
    end
  end

  def log_access_attempt(user_id, success)
    attempt = {
      user_id: user_id,
      success: success,
      timestamp: Time.now,
      ip_address: '127.0.0.1' # Would be actual IP in real implementation
    }
    
    @access_log << attempt
    @failed_attempts << attempt unless success
  end

  def failed_attempts_today
    today = Date.today
    @failed_attempts.select { |attempt| attempt[:timestamp].to_date == today }
  end

  def generate_session_id
    "session_#{Time.now.to_i}_#{rand(10000)}"
  end
end

# Notification facet for messaging and alerts
class NotificationFacet
  include Facet
  
  def initialize
    @subscribers = Hash.new { |hash, key| hash[key] = [] }
    @message_history = []
    @preferences = {
      email: true,
      sms: false,
      push: true,
      frequency: 'immediate'
    }
  end

  def subscribe(event_type, &handler)
    @subscribers[event_type] << handler
  end

  def unsubscribe(event_type, handler)
    @subscribers[event_type].delete(handler)
  end

  def notify(event_type, data = {})
    timestamp = Time.now
    message = {
      event_type: event_type,
      data: data,
      timestamp: timestamp
    }
    
    @message_history << message
    
    # Deliver to subscribers
    @subscribers[event_type].each do |handler|
      begin
        handler.call(message)
      rescue => e
        puts "Notification handler error: #{e.message}"
      end
    end
    
    # Simulate different delivery channels based on preferences
    deliver_message(message) if should_deliver?(event_type)
  end

  def set_preference(channel, enabled)
    @preferences[channel] = enabled
  end

  def set_frequency(frequency)
    raise ArgumentError, "Invalid frequency" unless %w[immediate hourly daily].include?(frequency)
    @preferences[:frequency] = frequency
  end

  def message_history(limit = nil)
    limit ? @message_history.last(limit) : @message_history.dup
  end

  def unread_count
    # In a real implementation, this would track read status
    @message_history.count { |msg| msg[:timestamp] > Time.now - 3600 } # Last hour
  end

  private

  def should_deliver?(event_type)
    # Simple delivery logic based on preferences
    case @preferences[:frequency]
    when 'immediate'
      true
    when 'hourly'
      @message_history.select { |msg| msg[:timestamp] > Time.now - 3600 }.size <= 1
    when 'daily'
      @message_history.select { |msg| msg[:timestamp] > Time.now - 86400 }.size <= 1
    else
      true
    end
  end

  def deliver_message(message)
    puts "? Email: #{message[:event_type]} - #{message[:data]}" if @preferences[:email]
    puts "? Push: #{message[:event_type]} - #{message[:data]}" if @preferences[:push]
    puts "? SMS: #{message[:event_type]} - #{message[:data]}" if @preferences[:sms]
  end
end

# Service class for coordinated operations
class EmployeeService
  def self.create_employee(name, id, department, email, capabilities = {})
    employee = Employee.new(name, id, department, email)
    faceted_employee = FacetedObject.new(employee)
    
    # Attach facets based on capabilities
    if capabilities[:account]
      account_facet = AccountFacet.new(capabilities[:account][:number], capabilities[:account][:balance])
      faceted_employee.attach_facet(account_facet)
    end
    
    if capabilities[:security]
      security_facet = SecurityFacet.new(capabilities[:security][:level])
      capabilities[:security][:permissions]&.each { |perm| security_facet.grant_permission(perm) }
      faceted_employee.attach_facet(security_facet)
    end
    
    if capabilities[:performance_tracking]
      faceted_employee.attach_facet(PerformanceFacet.new)
    end
    
    if capabilities[:notifications]
      notification_facet = NotificationFacet.new
      
      # Set up default notification handlers
      notification_facet.subscribe('financial_transaction') do |message|
        puts "? Financial Alert: #{message[:data][:type]} of $#{message[:data][:amount]}"
      end
      
      notification_facet.subscribe('performance_update') do |message|
        puts "? Performance Update: #{message[:data][:metric]} = #{message[:data][:value]}"
      end
      
      faceted_employee.attach_facet(notification_facet)
    end
    
    faceted_employee
  end

  def self.perform_secure_transaction(employee_obj, transaction_type, amount)
    employee_obj.requires_facets('security', 'account') do |obj|
      # Authenticate and check permissions
      security = obj.get_facet('security')
      security.require_permission('financial')
      
      # Perform transaction
      account = obj.get_facet('account')
      result = case transaction_type
               when 'deposit'
                 account.deposit(amount)
               when 'withdraw'
                 account.withdraw(amount)
               else
                 raise ArgumentError, "Unknown transaction type: #{transaction_type}"
               end
      
      # Send notification if available
      if obj.has_facet?('notification')
        notification = obj.get_facet('notification')
        notification.notify('financial_transaction', {
          type: transaction_type,
          amount: amount,
          new_balance: result,
          employee: obj.core_object.name
        })
      end
      
      result
    end
  end

  def self.update_performance(employee_obj, metric_name, value)
    employee_obj.with_facet('performance') do |performance|
      performance.set_metric(metric_name, value)
      
      # Notify if notification facet is available
      if employee_obj.has_facet?('notification')
        notification = employee_obj.get_facet('notification')
        notification.notify('performance_update', {
          metric: metric_name,
          value: value,
          employee: employee_obj.core_object.name
        })
      end
    end
  end

  def self.comprehensive_report(employee_obj)
    employee = employee_obj.core_object
    
    report = {
      employee_info: employee.to_h,
      attached_facets: employee_obj.facet_types,
      timestamp: Time.now
    }
    
    # Add facet-specific information
    if employee_obj.has_facet?('account')
      account = employee_obj.get_facet('account')
      report[:financial] = {
        account_number: account.account_number,
        balance: account.balance,
        recent_transactions: account.transaction_history(5)
      }
    end
    
    if employee_obj.has_facet?('performance')
      performance = employee_obj.get_facet('performance')
      report[:performance] = performance.performance_summary
    end
    
    if employee_obj.has_facet?('security')
      security = employee_obj.get_facet('security')
      report[:security] = security.security_report
    end
    
    if employee_obj.has_facet?('notification')
      notification = employee_obj.get_facet('notification')
      report[:notifications] = {
        unread_count: notification.unread_count,
        recent_messages: notification.message_history(3)
      }
    end
    
    report
  end
end

# Usage demonstration
def demonstrate_facet_system
  puts "=== Dynamic Facet Composition Demo ==="
  
  # Create employee with various capabilities
  employee_obj = EmployeeService.create_employee(
    'Sarah Connor', 'EMP003', 'Engineering', 'sarah.connor@company.com',
    {
      account: { number: 'ACC003', balance: 1000 },
      security: { level: 'manager', permissions: ['read', 'write', 'financial'] },
      performance_tracking: true,
      notifications: true
    }
  )
  
  puts "\n--- Initial Employee State ---"
  puts "Attached facets: #{employee_obj.facet_types.join(', ')}"
  
  # Demonstrate financial operations
  puts "\n--- Financial Operations ---"
  begin
    # First authenticate (in a real system)
    security = employee_obj.get_facet('security')
    session_id = security.authenticate(user_id: 'sarah', password: 'secret123')
    puts "Authentication successful: #{session_id}"
    
    # Perform transactions
    new_balance = EmployeeService.perform_secure_transaction(employee_obj, 'deposit', 500)
    puts "Deposit completed. New balance: $#{new_balance}"
    
    new_balance = EmployeeService.perform_secure_transaction(employee_obj, 'withdraw', 200)
    puts "Withdrawal completed. New balance: $#{new_balance}"
    
  rescue => e
    puts "Transaction failed: #{e.message}"
  end
  
  # Demonstrate performance tracking
  puts "\n--- Performance Tracking ---"
  EmployeeService.update_performance(employee_obj, 'projects_completed', 5)
  EmployeeService.update_performance(employee_obj, 'customer_satisfaction', 4.5)
  
  performance = employee_obj.get_facet('performance')
  performance.set_goal('projects_completed', 10, Date.today + 90)
  
  puts "Goal progress: #{performance.goal_progress('projects_completed')}"
  
  # Generate comprehensive report
  puts "\n--- Comprehensive Employee Report ---"
  report = EmployeeService.comprehensive_report(employee_obj)
  puts JSON.pretty_generate(report)
  
  # Demonstrate dynamic facet management
  puts "\n--- Dynamic Facet Management ---"
  puts "Before detachment: #{employee_obj.facet_types.join(', ')}"
  
  # Detach performance facet
  employee_obj.detach_facet('performance')
  puts "After detaching performance: #{employee_obj.facet_types.join(', ')}"
  
  # Try to use detached facet (should fail gracefully)
  begin
    EmployeeService.update_performance(employee_obj, 'test_metric', 1)
  rescue => e
    puts "Expected error when using detached facet: #{e.message}"
  end
end

# Run the demonstration
demonstrate_facet_system

The Ruby implementation showcases:

  • Metaprogramming Power: Dynamic method addition and removal using Ruby’s metaprogramming capabilities
  • Elegant Syntax: Clean, readable code that expresses intent clearly
  • Flexible Composition: Easy attachment and detachment of facets at runtime
  • Duck Typing: Natural method delegation without complex type hierarchies

Real-World Applications

The facets pattern proves particularly valuable in several domains:

Enterprise Software Integration

Modern enterprise systems often need to integrate with multiple external services. Facets allow core business objects to gain integration capabilities dynamically:

// Core customer object
const customer = new Customer('ABC Corp', 'enterprise');
const customerObj = new FacetedObject(customer);

// Attach integration facets based on configuration
if (config.salesforce.enabled) {
  customerObj.attachFacet(SalesforceFacet, config.salesforce.credentials);
}

if (config.stripe.enabled) {
  customerObj.attachFacet(PaymentFacet, config.stripe.apiKey);
}

if (config.analytics.enabled) {
  customerObj.attachFacet(AnalyticsFacet, config.analytics.trackingId);
}

Multi-Tenant SaaS Applications

Different tenants often require different feature sets. Facets enable feature composition based on subscription levels:

// Configure tenant capabilities based on plan
match subscription_plan {
    Plan::Basic => {
        tenant_obj.attach_facet(BasicAnalyticsFacet::new())?;
    },
    Plan::Professional => {
        tenant_obj.attach_facet(AdvancedAnalyticsFacet::new())?;
        tenant_obj.attach_facet(IntegrationFacet::new())?;
    },
    Plan::Enterprise => {
        tenant_obj.attach_facet(AdvancedAnalyticsFacet::new())?;
        tenant_obj.attach_facet(IntegrationFacet::new())?;
        tenant_obj.attach_facet(WhiteLabelFacet::new())?;
        tenant_obj.attach_facet(ApiAccessFacet::new())?;
    }
}

IoT Device Management

IoT devices often have optional capabilities that depend on hardware configuration or runtime conditions:

# Device base configuration
device_obj = FacetedObject.new(IoTDevice.new(device_id, device_type))

# Attach facets based on detected capabilities
if device.has_sensor?('temperature')
  device_obj.attach_facet(TemperatureFacet.new)
end

if device.has_connectivity?('wifi')
  device_obj.attach_facet(WiFiFacet.new)
end

if device.battery_powered?
  device_obj.attach_facet(PowerManagementFacet.new)
end

Performance Considerations

While facets provide tremendous flexibility, they come with performance trade-offs that must be carefully managed:

Method Resolution Overhead

Dynamic method resolution can introduce latency. Caching strategies help mitigate this:

class OptimizedFacetedObject<TCore> extends FacetedObject<TCore> {
  private methodCache: Map<string, Facet> = new Map();
  
  getFacetForMethod(methodName: string): Facet | undefined {
    // Check cache first
    if (this.methodCache.has(methodName)) {
      return this.methodCache.get(methodName);
    }
    
    // Search facets for method
    for (const facet of this.facets.values()) {
      if (typeof (facet as any)[methodName] === 'function') {
        this.methodCache.set(methodName, facet);
        return facet;
      }
    }
    
    return undefined;
  }
}

Memory Management

Facets can create reference cycles. Proper cleanup is essential:

impl Drop for FacetedObject {
    fn drop(&mut self) {
        // Clean up facet references
        for (_, facet) in self.facets.drain() {
            // Perform any necessary cleanup
            // Call facet-specific cleanup if implemented
        }
    }
}

Serialization Challenges

Faceted objects require special handling for persistence:

class FacetedObject
  def to_serializable
    {
      core_object: @core_object,
      facets: @facets.transform_values { |facet| serialize_facet(facet) },
      facet_types: @facets.keys
    }
  end
  
  def self.from_serializable(data)
    obj = new(data[:core_object])
    
    data[:facets].each do |type, facet_data|
      facet_class = Object.const_get("#{type.camelize}Facet")
      facet = facet_class.deserialize(facet_data)
      obj.attach_facet(facet)
    end
    
    obj
  end
  
  private
  
  def serialize_facet(facet)
    if facet.respond_to?(:serialize)
      facet.serialize
    else
      # Default serialization
      facet.instance_variables.each_with_object({}) do |var, hash|
        hash[var] = facet.instance_variable_get(var)
      end
    end
  end
end

Architecture Patterns and Best Practices

Facet Discovery and Registration

Large systems benefit from automatic facet discovery:

class FacetRegistry {
  private static facetClasses: Map<string, FacetConstructor<any>> = new Map();
  
  static register<T extends Facet>(facetClass: FacetConstructor<T>): void {
    this.facetClasses.set(facetClass.facetType, facetClass);
  }
  
  static createFacet<T extends Facet>(
    facetType: string, 
    ...args: any[]
  ): T | undefined {
    const FacetClass = this.facetClasses.get(facetType);
    return FacetClass ? new FacetClass(...args) : undefined;
  }
  
  static getAvailableFacets(): string[] {
    return Array.from(this.facetClasses.keys());
  }
}

// Automatic registration
@RegisterFacet
class EmailFacet implements Facet {
  static readonly facetType = 'email';
  // ...
}

Configuration-Driven Composition

Enable declarative facet composition through configuration:

# facet-config.yml
employee_types:
  manager:
    facets:
      - type: account
        config:
          initial_balance: 1000
      - type: permission
        config:
          role: manager
      - type: notification
        config:
          channels: [email, push]
  
  admin:
    inherits: manager
    facets:
      - type: audit
        config:
          level: detailed
      - type: permission
        config:
          role: admin
pub struct FacetComposer {
    config: HashMap<String, EmployeeTypeConfig>,
}

impl FacetComposer {
    pub fn compose_employee(&self, employee_type: &str, employee: Employee) -> Result<FacetedObject, String> {
        let config = self.config.get(employee_type)
            .ok_or_else(|| format!("Unknown employee type: {}", employee_type))?;
        
        let mut employee_obj = FacetedObject::new(employee);
        
        for facet_config in &config.facets {
            let facet = self.create_facet(&facet_config.facet_type, &facet_config.config)?;
            employee_obj.attach_facet(facet)?;
        }
        
        Ok(employee_obj)
    }
}

Testing Strategies

Faceted objects require comprehensive testing approaches:

RSpec.describe FacetedObject do
  let(:employee) { Employee.new('Test User', 'TEST001', 'Engineering', 'test@example.com') }
  let(:employee_obj) { FacetedObject.new(employee) }
  
  describe 'facet composition' do
    it 'allows dynamic attachment of facets' do
      account_facet = AccountFacet.new('ACC001', 1000)
      employee_obj.attach_facet(account_facet)
      
      expect(employee_obj.has_facet?('account')).to be true
      expect(employee_obj.balance).to eq 1000
    end
    
    it 'prevents duplicate facet attachment' do
      employee_obj.attach_facet(AccountFacet.new('ACC001'))
      
      expect {
        employee_obj.attach_facet(AccountFacet.new('ACC002'))
      }.to raise_error(ArgumentError, /already attached/)
    end
  end
  
  describe 'cross-facet operations' do
    before do
      employee_obj.attach_facet(AccountFacet.new('ACC001', 1000))
      employee_obj.attach_facet(SecurityFacet.new('manager'))
      employee_obj.attach_facet(NotificationFacet.new)
    end
    
    it 'coordinates operations across multiple facets' do
      expect {
        EmployeeService.perform_secure_transaction(employee_obj, 'withdraw', 100)
      }.to change { employee_obj.balance }.by(-100)
        .and output(/Financial Alert/).to_stdout
    end
  end
end

Comparison with Related Patterns

Facets vs Decorators

While both patterns add behavior dynamically, they serve different purposes:

Decorators: Wrap objects to modify or extend their interface Facets: Compose objects from multiple behavioral aspects

// Decorator pattern - wrapping behavior
class LoggingDecorator implements Employee {
  constructor(private wrapped: Employee) {}
  
  performAction(action: string): void {
    console.log(`Performing: ${action}`);
    this.wrapped.performAction(action);
    console.log(`Completed: ${action}`);
  }
}

// Facets pattern - compositional behavior
const employee = new FacetedObject(new EmployeeImpl());
employee.attachFacet(LoggingFacet);
employee.attachFacet(SecurityFacet);
// Employee now has both logging AND security capabilities

Facets vs Mixins

Mixins operate at the class level, facets at the instance level:

# Mixin - class-level composition
module Auditable
  def log_action(action)
    puts "Action: #{action}"
  end
end

class Employee
  include Auditable  # All instances get audit capability
end

# Facets - instance-level composition
employee1 = FacetedObject.new(Employee.new)
employee1.attach_facet(AuditFacet.new)  # Only this instance gets audit capability

employee2 = FacetedObject.new(Employee.new)  # This instance doesn't have audit

Emerging Patterns

AI-Driven Facet Composition

Machine learning could optimize facet composition based on usage patterns:

class IntelligentFacetComposer {
  private usageAnalyzer: UsageAnalyzer;
  private mlModel: FacetRecommendationModel;
  
  async recommendFacets(
    objectType: string, 
    context: CompositionContext
  ): Promise<FacetRecommendation[]> {
    const usagePatterns = await this.usageAnalyzer.analyze(objectType);
    const contextFeatures = this.extractFeatures(context);
    
    return this.mlModel.predict(usagePatterns, contextFeatures);
  }
  
  async optimizeForPerformance(
    facetedObject: FacetedObject<any>
  ): Promise<OptimizationSuggestions> {
    const usage = await this.usageAnalyzer.getObjectUsage(facetedObject);
    
    return {
      facetsToPreload: usage.frequentlyUsedFacets,
      facetsToLazyLoad: usage.rarelyUsedFacets,
      cacheStrategy: usage.recommendedCacheStrategy
    };
  }
}

Blockchain and Distributed Facets

Distributed systems could benefit from blockchain-verified facet capabilities:

pub struct DistributedFacetRegistry {
    blockchain_client: BlockchainClient,
    capability_verifier: CapabilityVerifier,
}

impl DistributedFacetRegistry {
    pub async fn verify_facet_capability(
        &self,
        facet_hash: &str,
        required_permissions: &[String]
    ) -> Result<bool, DistributedError> {
        // Verify facet authenticity on blockchain
        let facet_record = self.blockchain_client
            .get_facet_record(facet_hash).await?;
        
        // Verify permissions
        self.capability_verifier
            .verify_permissions(&facet_record, required_permissions)
    }
}

Conclusion

The facets pattern represents a powerful approach to runtime behavior composition that complements the Adaptive Object Model pattern I discussed previously. While AOM focuses on schema flexibility, facets address the equally important challenge of behavioral composition. The implementations in Rust, TypeScript, and Ruby demonstrate how this pattern can be adapted to different language paradigms while maintaining its core principles. Each language brings unique strengths: Rust’s type safety and performance, TypeScript’s gradual typing and tooling support, and Ruby’s metaprogramming elegance.

Unfortunately, ObjectSpace company that created Voyager went out of business and San Francisco Design Patterns book didn’t gain traction, in part because of its ties to the now-obsolete EJB technology and the performance overhead from using runtime reflection in the extension pattern. Nevertheless, the facets/extension pattern excels in domains requiring high configurability and runtime adaptability. However, it requires careful attention to performance implications and testing strategies. The pattern works best when you have clear separation of concerns and well-defined interfaces between facets. The combination of AOM for schema evolution and facets for behavior composition provides a comprehensive approach to building truly adaptive systems. Together, these patterns enable software that can evolve gracefully with changing requirements while maintaining performance and reliability.

The sample implementations are available at the Dynamic Facets Sample Project, providing working examples in all three languages discussed. These implementations serve as a foundation for building more sophisticated facet-based systems tailored to specific domain requirements.

September 8, 2025

Adaptive Object Model: A Modern Approach with Dynamic Languages and Document Databases

Filed under: Computing,Methodologies — admin @ 11:36 am

Background

I have long been interested in the Adaptive Object Model (AOM) pattern and used it in a couple of projects in early 2000s. I have also written about this pattern earlier, which emerged from the work of Ralph Johnson and his colleagues in the late 1990s. It addresses a fundamental challenge in software architecture: how to build systems that can evolve structurally without code changes or downtime. The pattern draws heavily from several foundational concepts in computer science and software engineering. The roots of AOM can be traced back to several influential ideas:

  • Reflection and Metaprogramming: Early Lisp systems showed the power of treating code as data, enabling programs to modify themselves at runtime. This concept heavily influenced the AOM pattern’s approach to treating metadata as first-class objects.
  • Type Theory: The work of pioneers like Alonzo Church and Haskell Curry on type systems provided the theoretical foundation for the “type square” pattern that forms AOM’s core structure, where types themselves become objects that can be manipulated.
  • Database Systems: The entity-attribute-value (EAV) model used in database design influenced AOM’s approach to storing flexible data structures.

Related Patterns

Following are other patterns that are related to AOM:

  • Facade Pattern: AOM often employs facades to provide simplified interfaces over complex meta-object structures, hiding the underlying complexity from client code.
  • Strategy Pattern: The dynamic binding of operations in AOM naturally implements the Strategy pattern, allowing algorithms to be selected and modified at runtime.
  • Composition over Inheritance: AOM uses the principle of favoring composition over inheritance by building complex objects from simpler, configurable components rather than rigid class hierarchies.
  • Domain-Specific Languages (DSLs): Many AOM implementations provide DSLs for defining entity types and relationships, making the system accessible to domain experts rather than just programmers.

Voyager ORB’s Dynamic Aggregation

In late 1990s/early 2000s, I used Voyager ORB for some personal projects that pioneered a concept of “Dynamic Aggregation” – the ability to attach secondary objects, called facets, to primary objects at runtime. This system demonstrated several key principles that later influenced AOM development:

  • Runtime Object Extension: Objects could be extended with new capabilities without modifying their original class definitions:
// Voyager ORB example - attaching an account facet to an employee
IEmployee employee = new Employee("joe", "234-44-2678");
IFacets facets = Facets.of(employee);
IAccount account = (IAccount) facets.of(IAccount.class);
account.deposit(2000);
  • Interface-based Composition: Facets were accessed through interfaces, providing a clean separation between capability and implementation – a principle central to modern AOM.
  • Distributed Object Mobility: Voyager‘s facet system worked seamlessly across network boundaries, allowing objects and their attached capabilities to move between different machines while maintaining their extended functionality.
  • Automatic Proxy Generation: Like modern AOM systems, Voyager automatically generated the necessary plumbing code at runtime, using Java’s reflection and bytecode manipulation capabilities.

The Voyager approach influenced distributed computing patterns and demonstrated that dynamic composition could work reliably in production systems. The idea of attaching behavior at runtime through well-defined interfaces is directly applicable to modern AOM implementations. The key insight from Voyager was that objects don’t need to know about all their potential capabilities at compile time. Instead, capabilities can be discovered, attached, and composed dynamically based on runtime requirements – a principle that AOM extends to entire domain models.

Introduction to Adaptive Object Model

Adaptive Object Model is an architectural pattern used in domains requiring dynamic manipulation of metadata and business rules. Unlike traditional object-oriented design where class structures are fixed at compile time, AOM treats class definitions, attributes, relationships, and even business rules as data that can be modified at runtime.

Consider our vehicle example again. In traditional OO design, you might have:

Vehicle
??? Car
?   ??? Sedan
?   ??? SUV
?   ??? Coupe
??? Motorcycle
??? Truck
    ??? PickupTruck
    ??? SemiTruck

With AOM, instead of predefined inheritance hierarchies, we use the “type square” pattern:

  • EntityType: Represents what would traditionally be a class
  • Entity: Represents what would traditionally be an object instance
  • PropertyType: Defines the schema for attributes
  • Property: Holds actual attribute values

This meta-model allows for unlimited extensibility without code changes, making it ideal for domains with rapidly evolving requirements or where different customers need different data models.

The Database Challenge: From Relational to Document

Traditional relational databases present significant challenges for AOM implementations:

  • Excessive Joins: In a relational AOM implementation, reconstructing a single business object requires joining multiple tables:
    • Entity table (object instances)
    • Property table (attribute values)
    • PropertyType table (attribute metadata)
    • EntityType table (type definitions)
  • Schema Rigidity: Relational schemas require predefined table structures, which conflicts with AOM’s goal of runtime flexibility.
  • Performance Issues: The EAV (Entity-Attribute-Value) pattern commonly used in relational AOM implementations suffers from poor query performance due to the lack of indexing on the “value” column’s varied data types.
  • Complex Queries: Simple business queries become complex multi-table joins with numerous conditions, making the system difficult to optimize and maintain.

The Document Database Solution

Document databases like MongoDB naturally align with AOM principles:

  • Schema Flexibility: Documents can contain arbitrary fields without predefined schemas, allowing entity types to evolve dynamically.
  • Nested Structures: Complex relationships and metadata can be stored within documents, reducing the need for joins.
  • Rich Querying: Modern document databases provide sophisticated query capabilities while maintaining flexibility.
  • Indexing: Flexible indexing strategies can be applied to document fields as needed.

Rust Implementation

Let’s implement AOM in Rust, taking advantage of its type safety while maintaining flexibility through traits and enums. Rust’s ownership model and pattern matching make it particularly well-suited for safe metaprogramming.

use std::collections::HashMap;
use serde::{Serialize, Deserialize};
use std::sync::Arc;

// Type-safe property values using enums
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum PropertyValue {
    String(String),
    Integer(i64),
    Float(f64),
    Boolean(bool),
    Date(chrono::DateTime<chrono::Utc>),
}

impl PropertyValue {
    pub fn type_name(&self) -> &'static str {
        match self {
            PropertyValue::String(_) => "String",
            PropertyValue::Integer(_) => "Integer", 
            PropertyValue::Float(_) => "Float",
            PropertyValue::Boolean(_) => "Boolean",
            PropertyValue::Date(_) => "Date",
        }
    }
}

// Property type definition
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PropertyType {
    pub name: String,
    pub value_type: String,
    pub required: bool,
    pub default_value: Option<PropertyValue>,
}

impl PropertyType {
    pub fn new(name: &str, value_type: &str, required: bool) -> Self {
        Self {
            name: name.to_string(),
            value_type: value_type.to_string(),
            required,
            default_value: None,
        }
    }
}

// Property instance
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Property {
    pub property_type: String, // Reference to PropertyType name
    pub value: PropertyValue,
}

impl Property {
    pub fn new(property_type: &str, value: PropertyValue) -> Self {
        Self {
            property_type: property_type.to_string(),
            value,
        }
    }
}

// Operation trait for dynamic behavior
pub trait Operation: Send + Sync + std::fmt::Debug {
    fn execute(&self, entity: &Entity, args: &[PropertyValue]) -> Result<PropertyValue, String>;
    fn name(&self) -> &str;
}

// Entity type definition
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EntityType {
    pub name: String,
    pub property_types: HashMap<String, PropertyType>,
    #[serde(skip)]
    pub operations: HashMap<String, Arc<dyn Operation>>,
}

impl EntityType {
    pub fn new(name: &str) -> Self {
        Self {
            name: name.to_string(),
            property_types: HashMap::new(),
            operations: HashMap::new(),
        }
    }

    pub fn add_property_type(&mut self, property_type: PropertyType) {
        self.property_types.insert(property_type.name.clone(), property_type);
    }

    pub fn add_operation(&mut self, operation: Arc<dyn Operation>) {
        self.operations.insert(operation.name().to_string(), operation);
    }

    pub fn get_property_type(&self, name: &str) -> Option<&PropertyType> {
        self.property_types.get(name)
    }
}

// Entity instance
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Entity {
    pub entity_type: String, // Reference to EntityType name
    pub properties: HashMap<String, Property>,
}

impl Entity {
    pub fn new(entity_type: &str) -> Self {
        Self {
            entity_type: entity_type.to_string(),
            properties: HashMap::new(),
        }
    }

    pub fn add_property(&mut self, property: Property) {
        self.properties.insert(property.property_type.clone(), property);
    }

    pub fn get_property(&self, name: &str) -> Option<&PropertyValue> {
        self.properties.get(name).map(|p| &p.value)
    }

    pub fn set_property(&mut self, name: &str, value: PropertyValue) {
        if let Some(property) = self.properties.get_mut(name) {
            property.value = value;
        }
    }
}

// Registry to manage types and instances
pub struct EntityRegistry {
    entity_types: HashMap<String, EntityType>,
    entities: HashMap<String, Entity>,
}

impl EntityRegistry {
    pub fn new() -> Self {
        Self {
            entity_types: HashMap::new(),
            entities: HashMap::new(),
        }
    }

    pub fn register_type(&mut self, entity_type: EntityType) {
        self.entity_types.insert(entity_type.name.clone(), entity_type);
    }

    pub fn create_entity(&mut self, type_name: &str, id: &str) -> Result<(), String> {
        if !self.entity_types.contains_key(type_name) {
            return Err(format!("Unknown entity type: {}", type_name));
        }
        
        let entity = Entity::new(type_name);
        self.entities.insert(id.to_string(), entity);
        Ok(())
    }

    // New method to get a mutable reference to an entity
    pub fn get_entity_mut(&mut self, id: &str) -> Option<&mut Entity> {
        self.entities.get_mut(id)
    }

    pub fn execute_operation(
        &self, 
        entity_id: &str, 
        operation_name: &str, 
        args: &[PropertyValue]
    ) -> Result<PropertyValue, String> {
        let entity = self.entities.get(entity_id)
            .ok_or_else(|| format!("Entity not found: {}", entity_id))?;
        
        let entity_type = self.entity_types.get(&entity.entity_type)
            .ok_or_else(|| format!("Entity type not found: {}", entity.entity_type))?;
        
        let operation = entity_type.operations.get(operation_name)
            .ok_or_else(|| format!("Operation not found: {}", operation_name))?;
        
        operation.execute(entity, args)
    }
}

// Example operations
#[derive(Debug)]
struct DriveOperation;

impl Operation for DriveOperation {
    fn execute(&self, entity: &Entity, _args: &[PropertyValue]) -> Result<PropertyValue, String> {
        if let Some(PropertyValue::String(maker)) = entity.get_property("maker") {
            Ok(PropertyValue::String(format!("Driving the {} vehicle", maker)))
        } else {
            Ok(PropertyValue::String("Driving vehicle".to_string()))
        }
    }

    fn name(&self) -> &str {
        "drive"
    }
}

#[derive(Debug)]
struct MaintenanceOperation;

impl Operation for MaintenanceOperation {
    fn execute(&self, entity: &Entity, _args: &[PropertyValue]) -> Result<PropertyValue, String> {
        if let Some(PropertyValue::Integer(miles)) = entity.get_property("miles") {
            let next_maintenance = miles + 5000;
            Ok(PropertyValue::String(format!("Next maintenance due at {} miles", next_maintenance)))
        } else {
            Ok(PropertyValue::String("Maintenance scheduled".to_string()))
        }
    }

    fn name(&self) -> &str {
        "perform_maintenance"
    }
}

// Usage example
fn example_usage() -> Result<(), String> {
    let mut registry = EntityRegistry::new();

    // Define vehicle type
    let mut vehicle_type = EntityType::new("Vehicle");
    vehicle_type.add_property_type(PropertyType::new("maker", "String", true));
    vehicle_type.add_property_type(PropertyType::new("model", "String", true));
    vehicle_type.add_property_type(PropertyType::new("year", "Integer", true));
    vehicle_type.add_property_type(PropertyType::new("miles", "Integer", false));
    
    vehicle_type.add_operation(Arc::new(DriveOperation));
    vehicle_type.add_operation(Arc::new(MaintenanceOperation));

    registry.register_type(vehicle_type);

    // Create a new entity instance
    registry.create_entity("Vehicle", "vehicle_1")?;
    
    // Get a mutable reference to the new entity and set its properties
    if let Some(car) = registry.get_entity_mut("vehicle_1") {
        car.add_property(Property::new("maker", PropertyValue::String("Tesla".to_string())));
        car.add_property(Property::new("model", PropertyValue::String("Model 3".to_string())));
        car.add_property(Property::new("year", PropertyValue::Integer(2022)));
        car.add_property(Property::new("miles", PropertyValue::Integer(15000)));
    }

    // Execute the drive operation and print the result
    let drive_result = registry.execute_operation("vehicle_1", "drive", &[])?;
    println!("Drive operation result: {:?}", drive_result);

    // Execute the maintenance operation and print the result
    let maintenance_result = registry.execute_operation("vehicle_1", "perform_maintenance", &[])?;
    println!("Maintenance operation result: {:?}", maintenance_result);

    Ok(())
}

fn main() {
    match example_usage() {
        Ok(_) => println!("Example completed successfully."),
        Err(e) => eprintln!("Error: {}", e),
    }
}

The Rust implementation provides several advantages:

  • Type Safety: Enum-based property values ensure type safety while maintaining flexibility.
  • Memory Safety: Rust’s ownership model prevents common memory issues found in dynamic systems.
  • Performance: Zero-cost abstractions and compile-time optimizations.
  • Concurrency: Built-in support for safe concurrent access to shared data.

TypeScript Implementation

TypeScript brings static typing to JavaScript’s dynamic nature, providing an excellent balance for AOM implementations:

// Type definitions for property values
type PropertyValue = string | number | boolean | Date;

interface PropertyType {
  name: string;
  valueType: string;
  required: boolean;
  defaultValue?: PropertyValue;
}

interface Property {
  propertyType: string;
  value: PropertyValue;
}

// Operation interface with proper typing
interface Operation {
  name: string;
  execute(entity: Entity, args: PropertyValue[]): PropertyValue;
}

// Generic constraint for entity properties
interface PropertyMap {
  [key: string]: PropertyValue;
}

class EntityType {
  private propertyTypes: Map<string, PropertyType> = new Map();
  private operations: Map<string, Operation> = new Map();

  constructor(public readonly typeName: string) {}

  addPropertyType(propertyType: PropertyType): void {
    this.propertyTypes.set(propertyType.name, propertyType);
  }

  addOperation(operation: Operation): void {
    this.operations.set(operation.name, operation);
  }

  getPropertyType(name: string): PropertyType | undefined {
    return this.propertyTypes.get(name);
  }

  getOperation(name: string): Operation | undefined {
    return this.operations.get(name);
  }

  getAllPropertyTypes(): PropertyType[] {
    return Array.from(this.propertyTypes.values());
  }

  // Type guard for property validation
  validateProperty(name: string, value: PropertyValue): boolean {
    const propertyType = this.getPropertyType(name);
    if (!propertyType) return false;

    switch (propertyType.valueType) {
      case 'string':
        return typeof value === 'string';
      case 'number':
        return typeof value === 'number';
      case 'boolean':
        return typeof value === 'boolean';
      case 'date':
        return value instanceof Date;
      default:
        return false;
    }
  }
}

class Entity {
  private properties: Map<string, Property> = new Map();

  constructor(public readonly entityType: EntityType) {
    // Initialize with default values
    entityType.getAllPropertyTypes().forEach(propType => {
      if (propType.defaultValue !== undefined) {
        this.setProperty(propType.name, propType.defaultValue);
      }
    });
  }

  setProperty(name: string, value: PropertyValue): boolean {
    if (!this.entityType.validateProperty(name, value)) {
      throw new Error(`Invalid property: ${name} with value ${value}`);
    }

    const propertyType = this.entityType.getPropertyType(name);
    if (!propertyType) {
      throw new Error(`Unknown property type: ${name}`);
    }

    this.properties.set(name, {
      propertyType: name,
      value
    });

    return true;
  }

  getProperty<T extends PropertyValue>(name: string): T | undefined {
    const property = this.properties.get(name);
    return property?.value as T;
  }

  executeOperation(operationName: string, args: PropertyValue[] = []): PropertyValue {
    const operation = this.entityType.getOperation(operationName);
    if (!operation) {
      throw new Error(`Unknown operation: ${operationName}`);
    }
    return operation.execute(this, args);
  }

  // Dynamic property access with Proxy
  static withDynamicAccess(entity: Entity): Entity & PropertyMap {
    return new Proxy(entity, {
      get(target, prop: string) {
        if (prop in target) {
          return (target as any)[prop];
        }
        return target.getProperty(prop);
      },
      set(target, prop: string, value: PropertyValue) {
        try {
          target.setProperty(prop, value);
          return true;
        } catch {
          return false;
        }
      }
    }) as Entity & PropertyMap;
  }
}

// Enhanced operation implementations
class DriveOperation implements Operation {
  name = 'drive';

  execute(entity: Entity, args: PropertyValue[]): PropertyValue {
    const maker = entity.getProperty<string>('maker') || 'Unknown';
    const speed = args[0] as number || 60;
    return `Driving the ${maker} at ${speed} mph`;
  }
}

class MaintenanceOperation implements Operation {
  name = 'performMaintenance';

  execute(entity: Entity, args: PropertyValue[]): PropertyValue {
    const miles = entity.getProperty<number>('miles') || 0;
    const maintenanceType = args[0] as string || 'basic';
    
    // Business logic for maintenance
    const cost = maintenanceType === 'premium' ? 150 : 75;
    const nextDue = miles + (maintenanceType === 'premium' ? 10000 : 5000);
    
    return `${maintenanceType} maintenance completed. Cost: $${cost}. Next due: ${nextDue} miles`;
  }
}

// Factory for creating entities with fluent interface
class EntityFactory {
  private types: Map<string, EntityType> = new Map();

  defineType(name: string): TypeBuilder {
    return new TypeBuilder(name, this);
  }

  registerType(entityType: EntityType): void {
    this.types.set(entityType.typeName, entityType);
  }

  createEntity(typeName: string): Entity {
    const type = this.types.get(typeName);
    if (!type) {
      throw new Error(`Unknown entity type: ${typeName}`);
    }
    return Entity.withDynamicAccess(new Entity(type));
  }
}

class TypeBuilder {
  private entityType: EntityType;

  constructor(typeName: string, private factory: EntityFactory) {
    this.entityType = new EntityType(typeName);
  }

  property(name: string, type: string, required = false, defaultValue?: PropertyValue): TypeBuilder {
    this.entityType.addPropertyType({ name, valueType: type, required, defaultValue });
    return this;
  }

  operation(operation: Operation): TypeBuilder {
    this.entityType.addOperation(operation);
    return this;
  }

  build(): EntityType {
    this.factory.registerType(this.entityType);
    return this.entityType;
  }
}

// Usage example with modern TypeScript features
const factory = new EntityFactory();

// Define vehicle type with fluent interface
factory.defineType('Vehicle')
  .property('maker', 'string', true)
  .property('model', 'string', true)
  .property('year', 'number', true, 2024)
  .property('miles', 'number', false, 0)
  .property('isElectric', 'boolean', false, false)
  .operation(new DriveOperation())
  .operation(new MaintenanceOperation())
  .build();

// Create and use vehicle with dynamic property access
const vehicle = factory.createEntity('Vehicle') as Entity & PropertyMap;

// Type-safe property access
vehicle.maker = 'Tesla';
vehicle.model = 'Model 3';
vehicle.isElectric = true;

console.log(vehicle.executeOperation('drive', [75]));
console.log(vehicle.executeOperation('performMaintenance', ['premium']));

// Dynamic property enumeration
Object.keys(vehicle).forEach(key => {
  console.log(`${key}: ${vehicle[key]}`);
});

The TypeScript implementation provides:

  • Gradual Typing: Mix dynamic and static typing as needed.
  • Modern Language Features: Generics, type guards, Proxy objects, and fluent interfaces.
  • Developer Experience: Excellent tooling support with autocomplete and type checking.
  • Flexibility: Easy migration from JavaScript while adding type safety incrementally.

Enhanced Ruby Implementation

Ruby’s metaprogramming capabilities make it particularly well-suited for AOM. Let’s enhance the original implementation with modern Ruby features:

require 'date'
require 'json'
require 'securerandom'

# Enhanced PropertyType with validation
class PropertyType
  attr_reader :name, :type, :required, :validator

  def initialize(name, type, required: false, default: nil, &validator)
    @name = name
    @type = type
    @required = required
    @default = default
    @validator = validator
  end

  def valid?(value)
    return false if @required && value.nil?
    return true if value.nil? && !@required
    
    type_valid = case @type
                 when :string then value.is_a?(String)
                 when :integer then value.is_a?(Integer)
                 when :float then value.is_a?(Float) || value.is_a?(Integer)
                 when :boolean then [true, false].include?(value)
                 when :date then value.is_a?(Date) || value.is_a?(Time)
                 else true
                 end
    
    type_valid && (@validator.nil? || @validator.call(value))
  end

  def default_value
    @default.respond_to?(:call) ? @default.call : @default
  end
end

# Enhanced EntityType with DSL
class EntityType
  attr_reader :name, :property_types, :operations, :validations

  def initialize(name, &block)
    @name = name
    @property_types = {}
    @operations = {}
    @validations = []
    
    instance_eval(&block) if block_given?
  end

  # DSL methods
  def property(name, type, **options, &validator)
    @property_types[name] = PropertyType.new(name, type, **options, &validator)
  end

  def operation(name, &block)
    @operations[name] = block
  end

  def validate(&block)
    @validations << block
  end

  def valid_entity?(entity)
    @validations.all? { |validation| validation.call(entity) }
  end

  def create_entity(**attributes)
    Entity.new(self, attributes)
  end
end

# Enhanced Entity with method delegation and validations
class Entity
  attr_reader :entity_type, :id

  def initialize(entity_type, attributes = {})
    @entity_type = entity_type
    @properties = {}
    @id = attributes.delete(:id) || SecureRandom.uuid
    
    # Set default values
    @entity_type.property_types.each do |name, prop_type|
      @properties[name] = prop_type.default_value unless prop_type.default_value.nil?
    end
    
    # Set provided attributes
    attributes.each { |name, value| set_property(name, value) }
    
    # Add dynamic methods for properties
    create_property_methods
    
    # Validate entity
    validate!
  end

  def set_property(name, value)
    prop_type = @entity_type.property_types[name]
    raise ArgumentError, "Unknown property: #{name}" unless prop_type
    raise ArgumentError, "Invalid value for #{name}" unless prop_type.valid?(value)
    
    @properties[name] = value
    # Removed the line `validate! if defined?(@properties)`
  end

  def get_property(name)
    @properties[name]
  end

  def execute_operation(name, *args)
    operation = @entity_type.operations[name]
    raise ArgumentError, "Unknown operation: #{name}" unless operation
    
    instance_exec(*args, &operation)
  end

  def to_h
    @properties.dup.merge(entity_type: @entity_type.name, id: @id)
  end

  def to_json(*args)
    to_h.to_json(*args)
  end

  private

  def create_property_methods
    @entity_type.property_types.each do |name, _|
      # Getter
      define_singleton_method(name) { get_property(name) }
      
      # Setter
      define_singleton_method("#{name}=") { |value| set_property(name, value) }
      
      # Predicate method for boolean properties
      if @entity_type.property_types[name].type == :boolean
        define_singleton_method("#{name}?") { !!get_property(name) }
      end
    end
  end

  def validate!
    # Check required properties
    @entity_type.property_types.each do |name, prop_type|
      if prop_type.required && @properties[name].nil?
        raise ArgumentError, "Required property missing: #{name}"
      end
    end
    
    # Run entity-level validations
    unless @entity_type.valid_entity?(self)
      raise ArgumentError, "Entity validation failed"
    end
  end
end

# Registry with persistence capabilities
class EntityRegistry
  def initialize
    @entity_types = {}
    @entities = {}
  end

  def define_type(name, &block)
    @entity_types[name] = EntityType.new(name, &block)
  end

  def create_entity(type_name, **attributes)
    entity_type = @entity_types[type_name]
    raise ArgumentError, "Unknown entity type: #{type_name}" unless entity_type
    
    entity = entity_type.create_entity(**attributes)
    @entities[entity.id] = entity
    entity
  end

  def find_entity(id)
    @entities[id]
  end

  def find_entities_by_type(type_name)
    @entities.values.select { |entity| entity.entity_type.name == type_name }
  end

  def export_to_json
    {
      entity_types: @entity_types.keys,
      entities: @entities.values.map(&:to_h)
    }.to_json
  end
end

# Usage example with modern Ruby features
registry = EntityRegistry.new

# Define vehicle type with validations
registry.define_type('Vehicle') do
  property :maker, :string, required: true
  property :model, :string, required: true
  property :year, :integer, required: true do |year|
    year.between?(1900, Date.today.year + 1)
  end
  property :miles, :integer, default: 0 do |miles|
    miles >= 0
  end
  property :electric, :boolean, default: false
  
  # Entity-level validation
  validate do |entity|
    # Electric vehicles should have zero emissions
    !entity.electric? || entity.year >= 2010
  end
  
  operation :drive do |distance = 10|
    current_miles = miles || 0
    self.miles = current_miles + distance
    "Drove #{distance} miles in #{maker} #{model}. Total miles: #{miles}"
  end
  
  operation :maintenance do |type = 'basic'|
    cost = type == 'premium' ? 150 : 75
    next_due = miles + (type == 'premium' ? 10000 : 5000)
    
    "#{type.capitalize} maintenance completed for #{maker} #{model}. " \
    "Cost: $#{cost}. Next maintenance due at #{next_due} miles."
  end
end

# Create and use vehicles
tesla = registry.create_entity('Vehicle', 
  maker: 'Tesla', 
  model: 'Model S', 
  year: 2023, 
  electric: true
)

toyota = registry.create_entity('Vehicle',
  maker: 'Toyota',
  model: 'Camry',
  year: 2022
)

# Use dynamic methods
puts tesla.execute_operation(:drive, 50)
puts toyota.execute_operation(:maintenance, 'premium')

# Access properties naturally
puts "#{tesla.maker} #{tesla.model} is electric: #{tesla.electric?}"
puts "Toyota has #{toyota.miles} miles"

# Export to JSON
puts registry.export_to_json

MongoDB Integration

Modern document databases like MongoDB provide natural storage for AOM entities. Here’s how to integrate AOM with MongoDB:

import { MongoClient, Collection, Db } from 'mongodb';

interface MongoEntity {
  _id?: string;
  entityType: string;
  properties: Record<string, any>;
  createdAt: Date;
  updatedAt: Date;
}

interface MongoEntityType {
  _id?: string;
  name: string;
  propertyTypes: Record<string, any>;
  version: number;
  createdAt: Date;
}

class MongoEntityStore {
  private db: Db;
  private entitiesCollection: Collection<MongoEntity>;
  private typesCollection: Collection<MongoEntityType>;

  constructor(db: Db) {
    this.db = db;
    this.entitiesCollection = db.collection('entities');
    this.typesCollection = db.collection('entity_types');
  }

  async saveEntityType(entityType: EntityType): Promise<void> {
    const mongoType: MongoEntityType = {
      name: entityType.typeName,
      propertyTypes: Object.fromEntries(
        entityType.getAllPropertyTypes().map(pt => [pt.name, pt])
      ),
      version: 1,
      createdAt: new Date()
    };

    await this.typesCollection.replaceOne(
      { name: entityType.typeName },
      mongoType,
      { upsert: true }
    );
  }

  async saveEntity(entity: Entity): Promise<string> {
    const mongoEntity: MongoEntity = {
      entityType: entity.entityType.typeName,
      properties: this.serializeProperties(entity),
      createdAt: new Date(),
      updatedAt: new Date()
    };

    const result = await this.entitiesCollection.insertOne(mongoEntity);
    return result.insertedId.toString();
  }

  async findEntitiesByType(typeName: string): Promise<any[]> {
    return await this.entitiesCollection
      .find({ entityType: typeName })
      .toArray();
  }

  async findEntity(id: string): Promise<MongoEntity | null> {
    return await this.entitiesCollection.findOne({ _id: id as any });
  }

  async updateEntity(id: string, updates: Record<string, any>): Promise<void> {
    await this.entitiesCollection.updateOne(
      { _id: id as any },
      { 
        $set: { 
          ...updates, 
          updatedAt: new Date() 
        } 
      }
    );
  }

  // Complex queries using MongoDB aggregation
  async getEntityStatistics(typeName: string): Promise<any> {
    return await this.entitiesCollection.aggregate([
      { $match: { entityType: typeName } },
      {
        $group: {
          _id: '$entityType',
          count: { $sum: 1 },
          avgMiles: { $avg: '$properties.miles' },
          makers: { $addToSet: '$properties.maker' }
        }
      }
    ]).toArray();
  }

  // Full-text search across entities
  async searchEntities(query: string): Promise<MongoEntity[]> {
    return await this.entitiesCollection
      .find({ $text: { $search: query } })
      .toArray();
  }

  private serializeProperties(entity: Entity): Record<string, any> {
    const result: Record<string, any> = {};
    entity.entityType.getAllPropertyTypes().forEach(pt => {
      const value = entity.getProperty(pt.name);
      if (value !== undefined) {
        result[pt.name] = value;
      }
    });
    return result;
  }
}

// Usage with indexes for performance
async function setupDatabase() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  
  const db = client.db('aom_example');
  const store = new MongoEntityStore(db);

  // Create indexes for better performance
  await db.collection('entities').createIndex({ entityType: 1 });
  await db.collection('entities').createIndex({ 'properties.maker': 1 });
  await db.collection('entities').createIndex({ 'properties.year': 1 });
  await db.collection('entities').createIndex(
    { 
      'properties.maker': 'text', 
      'properties.model': 'text' 
    }
  );

  return store;
}

Benefits of Document Storage

  • Schema Evolution: MongoDB’s flexible schema allows entity types to evolve without database migrations.
  • Rich Querying: MongoDB’s query language supports complex operations on nested documents.
  • Indexing Strategy: Flexible indexing on any field, including nested properties.
  • Aggregation Pipeline: Powerful analytics capabilities for business intelligence.
  • Horizontal Scaling: Built-in sharding support for handling large datasets.

Modern Applications and Future Directions

Contemporary Usage Patterns

  • Configuration Management: Modern applications use AOM-like patterns for feature flags, A/B testing configurations, and user preference systems.
  • API Gateway Configuration: Services like Kong and AWS API Gateway use dynamic configuration patterns similar to AOM.
  • Workflow Engines: Business process management systems employ AOM patterns to define configurable workflows.
  • Multi-tenant SaaS: AOM enables SaaS applications to provide customizable data models per tenant.

Emerging Technologies

  • GraphQL Schema Stitching: Dynamic schema composition shares conceptual similarities with AOM’s type composition.
  • Serverless Functions: Event-driven architectures benefit from AOM’s dynamic behavior binding.
  • Container Orchestration: Kubernetes uses similar patterns for dynamic resource management and configuration.
  • Low-Code Platforms: Modern low-code solutions extensively use AOM principles for visual application building.

Performance Considerations and Optimizations

Caching Strategies

class CachedEntityStore {
  private cache: Map<string, Entity> = new Map();
  private typeCache: Map<string, EntityType> = new Map();

  async getEntity(id: string): Promise<Entity | null> {
    // Check cache first
    if (this.cache.has(id)) {
      return this.cache.get(id)!;
    }

    // Load from database
    const entity = await this.store.findEntity(id);
    if (entity) {
      this.cache.set(id, entity);
    }
    
    return entity;
  }

  invalidateEntity(id: string): void {
    this.cache.delete(id);
  }
}

Lazy Loading and Materialized Views

For complex entity relationships, implement lazy loading and consider materialized views for frequently accessed computed properties.

Schema Evolution and Versioning

One of the most critical aspects of production AOM systems is managing schema evolution over time. Unlike traditional systems where database migrations handle schema changes, AOM systems must support dynamic evolution while maintaining data integrity and backward compatibility.

Version Management Strategy

interface EntityTypeVersion {
  version: number;
  entityTypeName: string;
  changes: SchemaChange[];
  compatibleWith: number[];
  deprecatedIn?: number;
  migrations: Migration[];
  createdAt: Date;
}

interface SchemaChange {
  type: 'ADD_PROPERTY' | 'REMOVE_PROPERTY' | 'MODIFY_PROPERTY' | 'ADD_OPERATION';
  propertyName?: string;
  oldType?: string;
  newType?: string;
  defaultValue?: any;
  migrationRequired: boolean;
}

interface Migration {
  fromVersion: number;
  toVersion: number;
  transform: (entity: any) => any;
  reversible: boolean;
}

Backward Compatibility Patterns

Additive Changes: New properties should be optional with sensible defaults:

// Safe evolution - adding optional property
let mut vehicle_type_v2 = vehicle_type_v1.clone();
vehicle_type_v2.add_property_type(PropertyType::new(
    "fuel_efficiency", 
    "Float", 
    false // not required
));
vehicle_type_v2.version = 2;

Property Type Changes: Handle type evolution gracefully:

class PropertyMigration {
  static migrateStringToEnum(oldValue: string, enumValues: string[]): string {
    // Attempt intelligent mapping
    const lowercaseValue = oldValue.toLowerCase();
    const match = enumValues.find(val => 
      val.toLowerCase().includes(lowercaseValue) ||
      lowercaseValue.includes(val.toLowerCase())
    );
    return match || enumValues[0]; // fallback to first enum value
  }
}

Multi-Version Support: Systems should support multiple schema versions simultaneously:

class EntityStore
  def save_entity(entity, force_version: nil)
    target_version = force_version || @current_schema_version
    
    if entity.schema_version != target_version
      migrated_entity = migrate_entity(entity, target_version)
      store_with_version(migrated_entity, target_version)
    else
      store_with_version(entity, entity.schema_version)
    end
  end
  
  private def migrate_entity(entity, target_version)
    current_version = entity.schema_version
    
    while current_version < target_version
      migration = find_migration(current_version, current_version + 1)
      entity = migration.transform(entity)
      current_version += 1
    end
    
    entity.schema_version = target_version
    entity
  end
end

Deployment Strategies

Blue-Green Schema Deployment: Deploy new schemas alongside existing ones, gradually migrating entities:

  1. Deploy new schema version to “green” environment
  2. Run both old and new versions in parallel
  3. Migrate entities in batches with rollback capability
  4. Switch traffic to new version
  5. Decommission old version after validation period

Feature Flags for Schema Changes: Control schema availability through configuration:

class SchemaFeatureFlags {
  private flags: Map<string, boolean> = new Map();
  
  enableSchemaVersion(entityType: string, version: number): void {
    this.flags.set(`${entityType}_v${version}`, true);
  }
  
  isSchemaVersionEnabled(entityType: string, version: number): boolean {
    return this.flags.get(`${entityType}_v${version}`) || false;
  }
}

Performance Optimization Deep Dive

AOM systems face unique performance challenges due to their dynamic nature. However, careful optimization can achieve performance comparable to traditional systems while maintaining flexibility.

Caching Strategies

Entity Type Definition Caching: Cache compiled entity types to avoid repeated parsing:

use std::sync::{Arc, RwLock};
use std::collections::HashMap;

pub struct EntityTypeCache {
    types: RwLock<HashMap<String, Arc<EntityType>>>,
    compiled_operations: RwLock<HashMap<String, CompiledOperation>>,
}

impl EntityTypeCache {
    pub fn get_or_compile(&self, type_name: &str) -> Arc<EntityType> {
        // Try read lock first
        {
            let cache = self.types.read().unwrap();
            if let Some(entity_type) = cache.get(type_name) {
                return entity_type.clone();
            }
        }
        
        // Compile with write lock
        let mut cache = self.types.write().unwrap();
        // Double-check pattern to avoid race conditions
        if let Some(entity_type) = cache.get(type_name) {
            return entity_type.clone();
        }
        
        let compiled_type = self.compile_entity_type(type_name);
        let arc_type = Arc::new(compiled_type);
        cache.insert(type_name.to_string(), arc_type.clone());
        arc_type
    }
}

Property Access Optimization: Use property maps with optimized access patterns:

class OptimizedEntity {
  private propertyCache: Map<string, any> = new Map();
  private accessCounts: Map<string, number> = new Map();
  
  getProperty<T>(name: string): T | undefined {
    // Track access patterns for optimization
    this.accessCounts.set(name, (this.accessCounts.get(name) || 0) + 1);
    
    // Check cache first
    if (this.propertyCache.has(name)) {
      return this.propertyCache.get(name);
    }
    
    // Load from storage and cache frequently accessed properties
    const value = this.loadPropertyFromStorage(name);
    if (this.accessCounts.get(name)! > 3) {
      this.propertyCache.set(name, value);
    }
    
    return value;
  }
}

Database Optimization

Strategic Indexing: Create indexes based on query patterns rather than all properties:

// MongoDB optimization for AOM queries
await db.collection('entities').createIndex({
  'entityType': 1,
  'properties.status': 1,
  'updatedAt': -1
}, {
  name: 'entity_status_time_idx',
  partialFilterExpression: {
    'properties.status': { $exists: true }
  }
});

// Compound index for common query patterns
await db.collection('entities').createIndex({
  'entityType': 1,
  'properties.category': 1,
  'properties.priority': 1
});

Query Optimization Patterns: Use aggregation pipelines for complex queries:

class OptimizedEntityStore {
  async findEntitiesWithAggregation(criteria) {
    return await this.collection.aggregate([
      // Match stage - use indexes
      {
        $match: {
          entityType: criteria.type,
          'properties.status': { $in: criteria.statuses }
        }
      },
      
      // Project only needed fields early
      {
        $project: {
          _id: 1,
          entityType: 1,
          'properties.name': 1,
          'properties.status': 1,
          'properties.priority': 1
        }
      },
      
      // Sort with index support
      {
        $sort: { 'properties.priority': -1, _id: 1 }
      },
      
      // Limit results early
      { $limit: criteria.limit || 100 }
    ]).toArray();
  }
}

Connection Pooling and Read Replicas: Optimize database connections for high-throughput scenarios:

class DatabaseManager {
  private writePool: ConnectionPool;
  private readPools: ConnectionPool[];
  
  async saveEntity(entity: Entity): Promise<void> {
    // Use write connection for mutations
    const connection = await this.writePool.getConnection();
    try {
      await connection.save(entity);
    } finally {
      this.writePool.releaseConnection(connection);
    }
  }
  
  async findEntities(query: any): Promise<Entity[]> {
    // Use read replicas for queries
    const readPool = this.selectOptimalReadPool();
    const connection = await readPool.getConnection();
    try {
      return await connection.find(query);
    } finally {
      readPool.releaseConnection(connection);
    }
  }
}

Memory Management

Lazy Loading: Load entity properties on demand:

class LazyEntity
  def initialize(entity_type, id)
    @entity_type = entity_type
    @id = id
    @loaded_properties = {}
    @all_loaded = false
  end
  
  def method_missing(method_name, *args)
    property_name = method_name.to_s
    
    if @entity_type.has_property?(property_name)
      load_property(property_name) unless @loaded_properties.key?(property_name)
      @loaded_properties[property_name]
    else
      super
    end
  end
  
  private def load_property(property_name)
    # Load single property from database
    value = Database.load_property(@id, property_name)
    @loaded_properties[property_name] = value
  end
end

Weak References for Caches: Prevent memory leaks in entity caches:

use std::sync::Weak;
use std::collections::HashMap;

pub struct WeakEntityCache {
    entities: HashMap<String, Weak<Entity>>,
}

impl WeakEntityCache {
    pub fn get(&mut self, id: &str) -> Option<Arc<Entity>> {
        // Clean up dead references periodically
        if let Some(weak_ref) = self.entities.get(id) {
            if let Some(entity) = weak_ref.upgrade() {
                return Some(entity);
            } else {
                self.entities.remove(id);
            }
        }
        None
    }
    
    pub fn insert(&mut self, id: String, entity: Arc<Entity>) {
        self.entities.insert(id, Arc::downgrade(&entity));
    }
}

Security and Validation Framework

Security in AOM systems is critical due to the dynamic nature of schema and operations. Traditional security models must be extended to handle runtime modifications safely.

Authorization Framework

Schema Modification Permissions: Control who can modify entity types:

interface SchemaPermission {
  principal: string; // user or role
  entityType: string;
  actions: SchemaAction[];
  conditions?: PermissionCondition[];
}

enum SchemaAction {
  CREATE_TYPE = 'CREATE_TYPE',
  MODIFY_TYPE = 'MODIFY_TYPE',
  DELETE_TYPE = 'DELETE_TYPE',
  ADD_PROPERTY = 'ADD_PROPERTY',
  REMOVE_PROPERTY = 'REMOVE_PROPERTY',
  ADD_OPERATION = 'ADD_OPERATION'
}

class SchemaAuthorizationService {
  checkPermission(
    principal: string, 
    action: SchemaAction, 
    entityType: string
  ): boolean {
    const permissions = this.getPermissions(principal);
    
    return permissions.some(permission => 
      permission.entityType === entityType &&
      permission.actions.includes(action) &&
      this.evaluateConditions(permission.conditions)
    );
  }
}

Property-Level Access Control: Fine-grained access control for sensitive properties:

use serde::{Serialize, Deserialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PropertyAccess {
    pub property_name: String,
    pub read_roles: Vec<String>,
    pub write_roles: Vec<String>,
    pub sensitive: bool,
}

impl Entity {
    pub fn get_property_secure(&self, name: &str, user_roles: &[String]) -> Result<Option<&PropertyValue>, SecurityError> {
        let access = self.entity_type.get_property_access(name)
            .ok_or(SecurityError::PropertyNotFound)?;
        
        if !access.read_roles.iter().any(|role| user_roles.contains(role)) {
            return Err(SecurityError::InsufficientPermissions);
        }
        
        if access.sensitive {
            self.audit_property_access(name, user_roles);
        }
        
        Ok(self.properties.get(name).map(|p| &p.value))
    }
}

Input Validation and Sanitization

Dynamic Property Validation: Validate properties based on runtime type definitions:

class PropertyValidator {
  static validate(
    property: Property, 
    propertyType: PropertyType, 
    context: ValidationContext
  ): ValidationResult {
    const errors: string[] = [];
    
    // Type validation
    if (!this.isValidType(property.value, propertyType.valueType)) {
      errors.push(`Invalid type for ${propertyType.name}`);
    }
    
    // Custom validation rules
    if (propertyType.validator) {
      try {
        const isValid = propertyType.validator(property.value, context);
        if (!isValid) {
          errors.push(`Custom validation failed for ${propertyType.name}`);
        }
      } catch (error) {
        errors.push(`Validation error: ${error.message}`);
      }
    }
    
    // Sanitization for string properties
    if (typeof property.value === 'string') {
      property.value = this.sanitizeString(property.value);
    }
    
    return {
      valid: errors.length === 0,
      errors,
      sanitizedValue: property.value
    };
  }
  
  private static sanitizeString(input: string): string {
    // Remove potentially dangerous content
    return input
      .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '')
      .replace(/javascript:/gi, '')
      .replace(/on\w+\s*=/gi, '');
  }
}

Business Rule Enforcement: Implement complex validation rules across entities:

class BusinessRuleEngine
  def initialize
    @rules = {}
  end
  
  def add_rule(entity_type, rule_name, &block)
    @rules[entity_type] ||= {}
    @rules[entity_type][rule_name] = block
  end
  
  def validate_entity(entity)
    errors = []
    
    if rules = @rules[entity.entity_type.name]
      rules.each do |rule_name, rule_block|
        begin
          result = rule_block.call(entity)
          unless result.valid?
            errors.concat(result.errors.map { |e| "#{rule_name}: #{e}" })
          end
        rescue => e
          errors << "Rule #{rule_name} failed: #{e.message}"
        end
      end
    end
    
    ValidationResult.new(errors.empty?, errors)
  end
end

# Usage example
rule_engine = BusinessRuleEngine.new

rule_engine.add_rule('Vehicle', 'valid_year') do |entity|
  year = entity.get_property('year')
  if year && (year < 1900 || year > Date.current.year + 1)
    ValidationResult.new(false, ['Year must be between 1900 and next year'])
  else
    ValidationResult.new(true, [])
  end
end

Operation Security

Safe Operation Binding: Ensure operations cannot execute arbitrary code:

class SecureOperationBinder {
  private allowedOperations: Set<string> = new Set();
  private operationSandbox: OperationSandbox;
  
  constructor() {
    // Whitelist of safe operations
    this.allowedOperations.add('calculate');
    this.allowedOperations.add('format');
    this.allowedOperations.add('validate');
    
    this.operationSandbox = new OperationSandbox({
      allowedGlobals: ['Math', 'Date'],
      timeoutMs: 5000,
      memoryLimitMB: 10
    });
  }
  
  bindOperation(name: string, code: string): Operation {
    if (!this.allowedOperations.has(name)) {
      throw new Error(`Operation ${name} not in whitelist`);
    }
    
    // Static analysis for dangerous patterns
    if (this.containsDangerousPatterns(code)) {
      throw new Error('Operation contains dangerous patterns');
    }
    
    return this.operationSandbox.compile(code);
  }
  
  private containsDangerousPatterns(code: string): boolean {
    const dangerousPatterns = [
      /eval\s*\(/,
      /Function\s*\(/,
      /require\s*\(/,
      /import\s+/,
      /process\./,
      /global\./,
      /window\./
    ];
    
    return dangerousPatterns.some(pattern => pattern.test(code));
  }
}

Anti-patterns and Common Pitfalls

Learning from failures is crucial for successful AOM implementations. Here are the most common anti-patterns and how to avoid them.

1. Over-Engineering Stable Domains

Anti-pattern: Applying AOM to domains that rarely change

// DON'T: Using AOM for basic user authentication
const userType = new EntityType('User');
userType.addProperty('username', 'string');
userType.addProperty('passwordHash', 'string');
userType.addProperty('email', 'string');

// Better: Use traditional class for stable domain
class User {
  constructor(
    public username: string,
    public passwordHash: string,
    public email: string
  ) {}
}

When to avoid AOM:

  • Core business entities that haven’t changed in years
  • Performance-critical code paths
  • Simple CRUD operations
  • Well-established domain models

2. Performance Neglect

Anti-pattern: Ignoring performance implications of dynamic queries

// DON'T: Loading all entity properties for simple operations
async function getEntityName(id) {
  const entity = await entityStore.loadFullEntity(id); // Loads everything
  return entity.getProperty('name');
}

// Better: Load only needed properties
async function getEntityName(id) {
  return await entityStore.loadProperty(id, 'name');
}

Performance Guidelines:

  • Monitor query performance continuously
  • Use database profiling tools
  • Implement property-level lazy loading
  • Cache frequently accessed entity types

3. Type Explosion

Anti-pattern: Creating too many similar entity types instead of using properties

// DON'T: Creating separate types for minor variations
const sedanType = new EntityType('Sedan');
const suvType = new EntityType('SUV');
const truckType = new EntityType('Truck');

// Better: Use discriminator properties
const vehicleType = new EntityType('Vehicle');
vehicleType.addProperty('bodyType', 'enum', {
  values: ['sedan', 'suv', 'truck']
});

Type Design Guidelines:

  • Prefer composition over type proliferation
  • Use enums and discriminator fields
  • Consider type hierarchies carefully
  • Regular type audits to identify similar types

4. Missing Business Constraints

Anti-pattern: Focusing on technical flexibility while ignoring business rules

# DON'T: Allowing any combination of properties
vehicle = registry.create_entity('Vehicle',
  maker: 'Tesla',
  fuel_type: 'gasoline',  # This makes no sense!
  electric: true
)

# Better: Implement cross-property validation
class VehicleValidator
  def validate(entity)
    if entity.electric? && entity.fuel_type != 'electric'
      raise ValidationError, "Electric vehicles cannot have gasoline fuel type"
    end
  end
end

Constraint Guidelines:

  • Define business rules explicitly
  • Implement cross-property validation
  • Use state machines for complex business logic
  • Regular business rule audits

5. Cache Invalidation Problems

Anti-pattern: Inconsistent cache invalidation strategies

// DON'T: Forgetting to invalidate dependent caches
impl EntityStore {
    fn update_entity_type(&mut self, entity_type: EntityType) {
        self.entity_types.insert(entity_type.name.clone(), entity_type);
        // Forgot to invalidate entity instances cache!
    }
}

// Better: Comprehensive invalidation strategy
impl EntityStore {
    fn update_entity_type(&mut self, entity_type: EntityType) {
        let type_name = entity_type.name.clone();
        
        // Update type cache
        self.entity_types.insert(type_name.clone(), entity_type);
        
        // Invalidate all dependent caches
        self.entity_cache.invalidate_by_type(&type_name);
        self.query_cache.invalidate_by_type(&type_name);
        self.compiled_operations.remove(&type_name);
        
        // Notify cache invalidation to other systems
        self.event_bus.publish(CacheInvalidationEvent::new(type_name));
    }
}

6. Inadequate Error Handling

Anti-pattern: Generic error messages that don’t help debugging

// DON'T: Vague error messages
throw new Error('Property validation failed');

// Better: Detailed, actionable error messages
throw new PropertyValidationError({
  entityType: 'Vehicle',
  entityId: 'vehicle_123',
  property: 'year',
  value: 1850,
  constraint: 'must be between 1900 and 2025',
  suggestedFix: 'Check data source for year property'
});

7. Security Oversights

Anti-pattern: Treating dynamic properties like static ones for security

# DON'T: No access control on dynamic properties
def get_property(entity_id, property_name):
    entity = load_entity(entity_id)
    return entity.get_property(property_name)  # No security check!

# Better: Property-level security
def get_property(entity_id, property_name, user_context):
    entity = load_entity(entity_id)
    
    if not has_property_access(user_context, entity.type, property_name):
        raise SecurityError(f"Access denied to property {property_name}")
    
    if is_sensitive_property(entity.type, property_name):
        audit_log.record_access(user_context, entity_id, property_name)
    
    return entity.get_property(property_name)

8. Testing Gaps

Anti-pattern: Only testing the happy path with AOM systems

// DON'T: Only test valid configurations
test('creates vehicle entity', () => {
  const vehicle = factory.createEntity('Vehicle', {
    maker: 'Toyota',
    model: 'Camry'
  });
  expect(vehicle.maker).toBe('Toyota');
});

// Better: Test edge cases and error conditions
describe('Vehicle Entity', () => {
  test('rejects invalid property types', () => {
    expect(() => {
      factory.createEntity('Vehicle', {
        maker: 123, // Should be string
        model: 'Camry'
      });
    }).toThrow('Invalid property type');
  });
  
  test('handles missing required properties', () => {
    expect(() => {
      factory.createEntity('Vehicle', {
        model: 'Camry' // Missing required 'maker'
      });
    }).toThrow('Required property missing: maker');
  });
});

Prevention Strategies

  • Regular Architecture Reviews: Schedule periodic reviews of entity type proliferation and usage patterns.
  • Performance Monitoring: Implement continuous monitoring of query performance and cache hit rates.
  • Security Audits: Regular audits of property access patterns and operation bindings.
  • Automated Testing: Comprehensive test suites covering edge cases and error conditions.
  • Documentation Standards: Maintain clear documentation of business rules and constraints.

Practical Implementation

To demonstrate these concepts in practice, I’ve created a sample project with working implementations in all three languages discussed: AOM Sample Project.

The repository includes:

  • Rust implementation (cargo run) – Type-safe AOM with memory safety
  • TypeScript implementation (npx ts-node app.ts) – Gradual typing with modern JavaScript features
  • Ruby implementation (ruby app.rb) – Metaprogramming-powered flexibility

Conclusion

The Adaptive Object Model pattern continues to evolve with modern programming languages and database technologies. While the core concepts remain the same, implementation approaches have been refined to take advantage of:

  • Type safety in languages like Rust and TypeScript
  • Better performance through caching and optimized data structures
  • Improved developer experience with modern tooling and language features
  • Scalable persistence using document databases and modern storage patterns

The combination of dynamic languages with flexible type systems and schema-less databases provides a powerful foundation for building adaptable systems. From my consulting experience implementing AOM on large projects, I’ve seen mixed results that highlight important considerations. The pattern’s flexibility is both its greatest strength and potential weakness. Without proper architectural discipline, teams can easily create overly complex systems with inconsistent entity types and validation rules. The dynamic nature that makes AOM powerful also requires more sophisticated debugging skills and comprehensive testing strategies than traditional static systems. In my early implementations using relational databases, we suffered from performance issues due to the excessive joins required to reconstruct entities from the normalized AOM tables. This was before NoSQL and document-oriented databases became mainstream. Modern document databases have fundamentally changed the viability equation by storing AOM entities naturally without the join penalties that plagued earlier implementations.

The practical implementations available at https://github.com/bhatti/aom-sample demonstrate that AOM is not just theoretical but a viable architectural approach for real-world systems. By studying these examples and adapting them to your specific domain requirements, you can build systems that gracefully evolve with changing business needs.

August 30, 2025

Bridging HTTP and gRPC: A Standardized Approach to Header Mapping in Microservices

Filed under: Computing,Web Services — admin @ 10:49 pm

Modern microservices architectures often require supporting both HTTP REST APIs and gRPC services simultaneously. While Google’s gRPC-Gateway provides HTTP and gRPC transcoding capabilities, the challenge of bidirectional header mapping between these protocols remains a common source of inconsistency, bugs, and maintenance overhead across services. This article explores the technical challenges of HTTP-gRPC header mapping, examines current approaches and their limitations, and presents a standardized middleware solution that addresses these issues.

Understanding gRPC AIP and HTTP/gRPC Transcoding

Google’s Application Programming Interface Improvement (AIP) standards define how to build consistent, intuitive APIs. For example, AIP-127: HTTP and gRPC Transcoding enables a single service implementation to serve both HTTP REST and gRPC traffic through protocol transcoding.

How gRPC-Gateway Transcoding Works

The gRPC-Gateway acts as a reverse proxy that translates HTTP requests into gRPC calls:

HTTP Client ? gRPC-Gateway ? gRPC Server
     ?              ?            ?
REST Request   Proto Message   gRPC Service

Following is the transcoding process:

  1. URL Path to RPC Method: HTTP paths map to gRPC service methods
  2. HTTP Body to Proto Message: JSON payloads become protobuf messages
  3. Query Parameters to Fields: URL parameters populate message fields
  4. HTTP Headers to gRPC Metadata: Headers become gRPC metadata key-value pairs

The Header Mapping Challenge

While gRPC-Gateway handles most transcoding automatically, header mapping requires explicit configuration. Consider this common scenario:

HTTP Request:

POST /v1/users
Authorization: Bearer abc123
X-Request-ID: req-456
X-User-Role: admin
Content-Type: application/json

Desired gRPC Metadata:

metadata.MD{
    "authorization": []string{"Bearer abc123"},
    "request-id":    []string{"req-456"}, 
    "user-role":     []string{"admin"},
}

Response Headers Needed:

X-Request-ID: req-456
X-Processing-Time: 150ms
X-Server-Version: v1.2.0

Without proper configuration, headers are lost, inconsistently mapped, or require custom code in each service.

Current Problems and Anti-Patterns

Problem 1: Fragmented Header Mapping Solutions

Most services implement header mapping ad-hoc:

// Service A approach
func (s *ServiceA) CreateUser(ctx context.Context, req *pb.CreateUserRequest) (*pb.User, error) {
    md, _ := metadata.FromIncomingContext(ctx)
    authHeader := md.Get("authorization")
    userID := md.Get("x-user-id")
    // ... custom mapping logic
}

// Service B approach  
func (s *ServiceB) GetOrder(ctx context.Context, req *pb.GetOrderRequest) (*pb.Order, error) {
    // Different header names, different extraction logic
    md, _ := metadata.FromIncomingContext(ctx)
    auth := md.Get("auth")  // Different from Service A!
    requestID := md.Get("request_id")  // Different format!
}

This leads to:

  • Inconsistent header naming across services
  • Duplicated mapping logic in every service
  • Maintenance burden when headers change
  • Testing complexity due to custom implementations

Problem 2: Context Abuse and Memory Issues

I have often observed misuse of Go’s context for storing large amounts of data that puts the service at risk of being killed due to OOM:

// ANTI-PATTERN: Storing large objects in context
type UserContext struct {
    User        *User           // Large user object
    Permissions []Permission    // Array of permissions  
    Preferences *UserPrefs      // User preferences
    AuditLog    []AuditEntry   // Historical data
}

func StoreUserInContext(ctx context.Context, user *UserContext) context.Context {
    return context.WithValue(ctx, "user", user)  // BAD: Large object in context
}

Why This Causes Problems:

  1. Memory Leaks: Contexts are passed through the entire request chain and may not be garbage collected promptly
  2. Performance Degradation: Large context objects increase allocation pressure
  3. Goroutine Overhead: Each concurrent request carries this memory burden
  4. Service Instability: Under load, memory usage can spike and cause OOM kills

Proper Pattern:

// GOOD: Store only identifiers in context  
func StoreUserIDInContext(ctx context.Context, userID string) context.Context {
    return context.WithValue(ctx, "user_id", userID)  // Small string only
}

// Fetch data when needed from database/cache
func GetUserFromContext(ctx context.Context) (*User, error) {
    userID := ctx.Value("user_id").(string)
    return userService.GetUser(userID)  // Fetch from datastore
}

Problem 3: Inconsistent Response Header Handling

Setting response headers requires different approaches across the stack:

// gRPC: Set headers via metadata
grpc.SendHeader(ctx, metadata.New(map[string]string{
    "x-server-version": "v1.2.0",
}))

// HTTP: Set headers on ResponseWriter  
w.Header().Set("X-Server-Version", "v1.2.0")

// gRPC-Gateway: Headers must be set in specific metadata format
grpc.SetHeader(ctx, metadata.New(map[string]string{
    "grpc-metadata-x-server-version": "v1.2.0",  // Prefix required
}))

This complexity leads to missing response headers and inconsistent client experiences.

Solution: Standardized Header Mapping Middleware

The solution is a dedicated middleware that handles bidirectional header mapping declaratively, allowing services to focus on business logic while ensuring consistent header handling across the entire API surface.

Core Architecture

HTTP Request ? Gateway Middleware ? gRPC Interceptor ? Service
     ?              ?                    ?              ?
HTTP Headers ? Metadata Annotation ? Context Metadata ? Business Logic
                                                         ?
HTTP Response ? Response Modifier ? Header Metadata ? Service Response

The middleware operates at two key points:

  1. Gateway Level: Maps HTTP headers to gRPC metadata for incoming requests
  2. Interceptor Level: Processes metadata and manages response header mapping

Configuration-Driven Approach

Instead of custom code, header mapping is configured declaratively:

mapper := headermapper.NewBuilder().
    // Authentication headers
    AddIncomingMapping("Authorization", "authorization").WithRequired(true).
    AddIncomingMapping("X-API-Key", "api-key").
    
    // Request tracking (bidirectional)  
    AddBidirectionalMapping("X-Request-ID", "request-id").
    AddBidirectionalMapping("X-Trace-ID", "trace-id").
    
    // Response headers
    AddOutgoingMapping("processing-time", "X-Processing-Time").
    AddOutgoingMapping("server-version", "X-Server-Version").
    
    // Transformations
    AddIncomingMapping("Authorization", "auth-token").
    WithTransform(headermapper.ChainTransforms(
        headermapper.TrimSpace,
        headermapper.RemovePrefix("Bearer "),
    )).
    
    Build()

This configuration drives all header mapping behavior without requiring service-specific code.

How The Middleware Works: Step-by-Step

Step 1: HTTP Request Processing

When an HTTP request arrives at the gRPC-Gateway:

POST /v1/users HTTP/1.1
Authorization: Bearer abc123
X-Request-ID: req-456
X-User-Role: admin
Content-Type: application/json

The MetadataAnnotator processes configured incoming mappings:

func (hm *HeaderMapper) MetadataAnnotator() func(context.Context, *http.Request) metadata.MD {
    return func(ctx context.Context, req *http.Request) metadata.MD {
        md := metadata.New(map[string]string{})
        
        for _, mapping := range hm.config.Mappings {
            if mapping.Direction == Outgoing {
                continue  // Skip outgoing-only mappings
            }
            
            headerValue := req.Header.Get(mapping.HTTPHeader)
            if headerValue != "" {
                // Apply transformations if configured
                if mapping.Transform != nil {
                    headerValue = mapping.Transform(headerValue)
                }
                md.Set(mapping.GRPCMetadata, headerValue)
            }
        }
        return md
    }
}

Result: HTTP headers become gRPC metadata:

metadata.MD{
    "authorization": []string{"Bearer abc123"},
    "auth-token":    []string{"abc123"},        // Transformed  
    "request-id":    []string{"req-456"},
    "user-role":     []string{"admin"},
}

Step 2: gRPC Interceptor Processing

The gRPC unary interceptor receives the enhanced context:

func (hm *HeaderMapper) UnaryServerInterceptor() grpc.UnaryServerInterceptor {
    return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
        // Context already contains mapped metadata from Step 1
        
        // Call the actual service method
        resp, err := handler(ctx, req)
        
        // Response headers are handled by ResponseModifier
        return resp, err
    }
}

Step 3: Service Implementation

The service method accesses headers through standard gRPC metadata APIs:

func (s *UserService) CreateUser(ctx context.Context, req *pb.CreateUserRequest) (*pb.User, error) {
    md, _ := metadata.FromIncomingContext(ctx)
    
    // Headers are consistently available
    authToken := getFirstValue(md, "auth-token")      // "abc123" (transformed)
    requestID := getFirstValue(md, "request-id")      // "req-456"  
    userRole := getFirstValue(md, "user-role")        // "admin"
    
    // Set response headers
    grpc.SetHeader(ctx, metadata.New(map[string]string{
        "processing-time": "150",
        "server-version": "v1.2.0",  
        "request-id": requestID,     // Echo back request ID
    }))
    
    return &pb.User{...}, nil
}

Step 4: Response Header Processing

The ResponseModifier maps gRPC metadata to HTTP response headers:

func (hm *HeaderMapper) ResponseModifier() func(context.Context, http.ResponseWriter, proto.Message) error {
    return func(ctx context.Context, w http.ResponseWriter, msg proto.Message) error {
        md, ok := runtime.ServerMetadataFromContext(ctx)
        if !ok {
            return nil
        }
        
        for _, mapping := range hm.config.Mappings {
            if mapping.Direction == Incoming {
                continue  // Skip incoming-only mappings  
            }
            
            values := md.HeaderMD.Get(mapping.GRPCMetadata)
            if len(values) > 0 {
                headerValue := values[0]
                
                // Apply transformations
                if mapping.Transform != nil {
                    headerValue = mapping.Transform(headerValue)  
                }
                
                w.Header().Set(mapping.HTTPHeader, headerValue)
            }
        }
        return nil
    }
}

Final HTTP Response:

HTTP/1.1 200 OK
X-Request-ID: req-456
X-Processing-Time: 150ms  
X-Server-Version: v1.2.0
Content-Type: application/json

{"user": {...}}

Advanced Features

Header Transformations

The middleware supports header value transformations:

// Extract JWT tokens
AddIncomingMapping("Authorization", "jwt-token").
WithTransform(headermapper.ChainTransforms(
    headermapper.TrimSpace,
    headermapper.RemovePrefix("Bearer "),
    headermapper.Truncate(100),  // Prevent large tokens
))

// Sanitize user agents
AddIncomingMapping("User-Agent", "client-info").  
WithTransform(headermapper.RegexReplace(`\d+\.\d+(\.\d+)*`, "x.x.x"))

// Format timestamps
AddOutgoingMapping("response-time", "X-Response-Time").
WithTransform(headermapper.AddSuffix("ms"))

Configuration from Files

For complex deployments, configuration can be externalized:

# header-mapping.yaml
mappings:
  - http_header: "Authorization"
    grpc_metadata: "authorization" 
    direction: 0  # Incoming
    required: true
    
  - http_header: "X-Request-ID"
    grpc_metadata: "request-id"
    direction: 2  # Bidirectional
    default_value: "auto-generated"

skip_paths:
  - "/health"
  - "/metrics"
  
debug: false
config, err := headermapper.LoadConfigFromFile("header-mapping.yaml")
if err != nil {
    log.Fatal("Failed to load config:", err)
}

mapper := headermapper.NewHeaderMapper(config)

Path-Based Filtering

Skip header processing for specific endpoints:

mapper := headermapper.NewBuilder().
    AddIncomingMapping("Authorization", "authorization").
    SkipPaths("/health", "/metrics", "/debug").  // No auth required
    Build()

Integration Guide

Basic Integration

package main

import (
    "github.com/your-org/grpc-header-mapper/headermapper"
    "github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
)

func main() {
    // Create header mapper
    mapper := headermapper.NewBuilder().
        AddIncomingMapping("Authorization", "authorization").
        AddBidirectionalMapping("X-Request-ID", "request-id").
        Build()
    
    // Configure gRPC server
    grpcServer := grpc.NewServer(
        grpc.UnaryInterceptor(mapper.UnaryServerInterceptor()),
    )
    
    // Configure HTTP gateway
    mux := headermapper.CreateGatewayMux(mapper)
    
    // Register services...
}

Production Deployment

func createProductionMapper() *headermapper.HeaderMapper {
    return headermapper.NewBuilder().
        // Authentication
        AddIncomingMapping("Authorization", "authorization").WithRequired(true).
        AddIncomingMapping("X-API-Key", "api-key").
        
        // Request correlation
        AddBidirectionalMapping("X-Request-ID", "request-id").
        AddBidirectionalMapping("X-Correlation-ID", "correlation-id"). 
        AddBidirectionalMapping("X-Trace-ID", "trace-id").
        
        // Client information
        AddIncomingMapping("User-Agent", "user-agent").
        AddIncomingMapping("X-Client-Version", "client-version").
        
        // Response headers
        AddOutgoingMapping("processing-time-ms", "X-Processing-Time").
        AddOutgoingMapping("server-version", "X-Server-Version").
        AddOutgoingMapping("rate-limit-remaining", "X-RateLimit-Remaining").
        
        // Security headers
        AddOutgoingMapping("content-security-policy", "Content-Security-Policy").
        WithDefault("default-src 'self'").
        
        // Skip system endpoints
        SkipPaths("/health", "/metrics", "/debug", "/admin").
        
        // Production settings
        Debug(false).
        OverwriteExisting(true).
        Build()
}

Performance and Reliability Benefits

Consistent Memory Usage

By standardizing header extraction and avoiding context abuse, services maintain predictable memory profiles:

// Before: Inconsistent, potentially large context values
ctx = context.WithValue(ctx, "user", largeUserObject)      // BAD
ctx = context.WithValue(ctx, "permissions", permissionList) // BAD

// After: Consistent, minimal context usage  
// Headers extracted to standard metadata, large objects fetched on-demand
func GetUserFromContext(ctx context.Context) (*User, error) {
    userID := getMetadata(ctx, "user-id")
    return userCache.Get(userID)  // Cached lookup
}

Reduced Code Duplication

Header mapping logic is centralized, eliminating per-service implementations:

Improved Observability

Consistent header handling enables better monitoring:

// All services automatically have request correlation
func (s *AnyService) AnyMethod(ctx context.Context, req *AnyRequest) (*AnyResponse, error) {
    requestID := getMetadata(ctx, "request-id")  // Always available
    log.WithField("request_id", requestID).Info("Processing request")
    
    // Business logic...
    
    return response, nil
}

Testing Benefits

Standardized header mapping simplifies integration testing:

func TestServiceWithHeaders(t *testing.T) {
    // Headers work consistently across all services
    client := pb.NewUserServiceClient(conn)
    
    ctx := metadata.NewOutgoingContext(context.Background(), metadata.New(map[string]string{
        "authorization": "Bearer test-token",
        "request-id":    "test-req-123",
    }))
    
    resp, err := client.CreateUser(ctx, &pb.CreateUserRequest{...})
    
    // Response headers are consistently available
    md, _ := metadata.FromIncomingContext(ctx)
    requestID := getMetadata(md, "request-id")  // "test-req-123"
}

Security Considerations

Header Validation

The middleware supports header validation and sanitization:

mapper := headermapper.NewBuilder().
    AddIncomingMapping("Authorization", "authorization").
    WithTransform(headermapper.ChainTransforms(
        headermapper.TrimSpace,
        headermapper.Truncate(512),  // Prevent oversized headers
        validateJWTFormat,           // Custom validation
    )).
    Build()

func validateJWTFormat(token string) string {
    if !strings.HasPrefix(token, "Bearer ") {
        return "invalid"  // Reject malformed tokens
    }
    return token
}

Sensitive Data Handling

Headers containing sensitive data can be masked in logs:

AddIncomingMapping("Authorization", "authorization").
WithTransform(headermapper.MaskSensitive(4)).  // Show first/last 4 chars

Rate Limiting Integration

Response headers can include rate limiting information:

AddOutgoingMapping("rate-limit-remaining", "X-RateLimit-Remaining").
AddOutgoingMapping("rate-limit-reset", "X-RateLimit-Reset").

Monitoring and Debugging

Debug Mode

Enable debug logging to verify header mapping:

mapper := headermapper.NewBuilder().
    Debug(true).  // Enable detailed logging
    Build()

mapper.SetLogger(customLogger)  // Use your logging framework

Debug Output:

[DEBUG] [HeaderMapper] Mapped incoming headers: map[authorization:[Bearer abc123] request-id:[req-456]]
[DEBUG] [HeaderMapper] Mapped outgoing headers to response  

Metrics Integration

The middleware can integrate with monitoring systems:

stats := mapper.GetStats()
prometheus.IncomingHeadersMappedCounter.Add(stats.IncomingMappings)
prometheus.OutgoingHeadersMappedCounter.Add(stats.OutgoingMappings)
prometheus.MappingErrorsCounter.Add(stats.FailedMappings)

Why This Matters

Microservices Consistency

In large microservices architectures, inconsistent header handling creates operational overhead:

  • Debugging becomes difficult when services use different header names
  • Client libraries must handle different header formats per service
  • Security policies cannot be uniformly enforced
  • Observability suffers from inconsistent request correlation

Standardized header mapping addresses these issues by ensuring consistency across the entire service mesh.

Developer Productivity

Developers spend significant time on infrastructure concerns rather than business logic. This middleware eliminates:

  • Boilerplate code for header extraction and response setting
  • Testing complexity around header handling edge cases
  • Documentation overhead for service-specific header requirements
  • Bug investigation related to missing or malformed headers

Operational Excellence

Standard header mapping enables:

  • Automated monitoring with consistent request correlation
  • Security scanning with predictable header formats
  • Performance analysis across service boundaries
  • Compliance auditing with standardized access logging

Conclusion

HTTP and gRPC transcoding is a powerful pattern for modern APIs, but header mapping complexity has been a persistent challenge. The gRPC Header Mapper middleware presented in this article provides a solution that enables true bidirectional header mapping between HTTP and gRPC protocols.

By providing a standardized, configuration-driven middleware solution available at github.com/bhatti/grpc-header-mapper, teams can:

  1. Eliminate inconsistencies across services with bidirectional header mapping
  2. Reduce maintenance burden through centralized configuration
  3. Improve reliability by avoiding context misuse and memory leaks
  4. Enhance developer productivity by removing boilerplate code
  5. Support complex transformations with built-in and custom transformation functions

The middleware’s bidirectional mapping capability means that headers flow seamlessly in both directions – HTTP requests to gRPC metadata for service processing, and gRPC metadata back to HTTP response headers for client consumption. This eliminates the common problem where request headers are available to services but response headers are lost or inconsistently handled.

The complete implementation, examples, and documentation are available at github.com/bhatti/grpc-header-mapper.

August 25, 2025

Beyond Vibe Coding: Using TLA+ and Executable Specifications with Claude

Filed under: Computing,Uncategorized — admin @ 9:45 pm

TL;DR: The Problem and Solution

Problem: AI-assisted coding fails when modifying existing systems because we give AI vague specifications.

Solution: Use TLA+ formal specifications as precise contracts that Claude can implement reliably.

Result: Transform Claude from a code generator into a reliable engineering partner that reasons about complex systems.

After months of using Claude for development, I discovered most AI-assisted coding fails not because the AI isn’t smart enough, but because we’re asking it to work from vague specifications. This post shows you how to move beyond “vibe coding” using executable specifications that turn Claude into a reliable engineering partner.

Here’s what changes when you use TLA+ with Claude:

Before (Vibe Coding):

  • “Create a task management API”
  • Claude guesses at requirements
  • Inconsistent behavior across edge cases
  • Bugs in corner cases

After (TLA+ Specifications):

  • Precise mathematical specification
  • Claude implements exactly what you specified
  • All edge cases defined upfront
  • Properties verified before deployment

The Vibe Coding Problem

AI assistants like Claude are primarily trained on greenfield development patterns. They excel at:

  • Writing new functions from scratch
  • Implementing well-known algorithms
  • Creating boilerplate code

But they struggle with:

  • Understanding implicit behavioral contracts in existing code
  • Maintaining invariants across system modifications
  • Reasoning about state transitions and edge cases
  • Preserving non-functional requirements (performance, security, etc.)

The solution isn’t better prompts – it’s better specifications.

Enter Executable Specifications

An executable specification is a formal description of system behavior that can be:

  1. Verified – Checked for logical consistency
  2. Validated – Tested against real-world scenarios
  3. Executed – Run to generate test cases or even implementations

I’ve tried many approaches to precise specifications over the years:

UML and Model Driven Development (2000s-2010s): I used tools like Rational Rose and Visual Paradigm in early 2000s that promised complete code generation from UML models. The reality was different:

  • Visual complexity: UML diagrams became unwieldy for anything non-trivial
  • Tool lock-in: Proprietary formats and expensive tooling
  • Impedance mismatch: The gap between UML models and real code was huge
  • Maintenance nightmare: Keeping models and code synchronized was nearly impossible
  • Limited expressiveness: UML couldn’t capture complex behavioral contracts

BDD and Gherkin (mid-2000s): I used BDD and Gherkin in mid 2000s, which were better than UML for behavioral specifications, but still limited:

  • Structured natural language: Readable but not truly executable
  • No logical reasoning: Couldn’t catch design contradictions
  • Testing focused: Good for acceptance criteria, poor for system design

TLA+ (present): Takes executable specifications to their logical conclusion:

  • Mathematical precision: Eliminates ambiguity completely
  • Model checking: Explores all possible execution paths
  • Tool independence: Plain text specifications, open source tools
  • Behavioral focus: Designed specifically for concurrent and distributed systems

Why TLA+ with Claude?

The magic happens when you combine TLA+’s precision with Claude’s implementation capabilities:

  1. TLA+ eliminates ambiguity – There’s only one way to interpret a formal specification
  2. Claude can read TLA+ – It understands the formal syntax and can translate it to code
  3. Verification catches design flaws – TLA+ model checking finds edge cases you’d miss
  4. Generated traces become tests – TLA+ execution paths become your test suite

Setting Up Your Claude and TLA+ Environment

Installing Claude Desktop

First, let’s get Claude running on your machine:

# Install via Homebrew (macOS)
brew install --cask claude

# Or download directly from Anthropic
# https://claude.ai/download
  • Set up project-specific contexts in ~/.claude/
  • Create TLA+ syntax rules for better code generation
  • Configure memory settings for specification patterns

Configuring Your Workspace

Once installed, I recommend creating a dedicated workspace structure. Here’s what works for me:

# Create a Claude workspace directory
mkdir -p ~/claude-workspace/{projects,templates,context}

# Add a context file for your coding standards
cat > ~/claude-workspace/context/coding-standards.md << 'EOF'
# My Coding Standards

- Use descriptive variable names
- Functions should do one thing well
- Write tests for all new features
- Handle errors explicitly
- Document complex logic
EOF

Installing TLA+ Tools

Choose based on your workflow

  • GUI users: TLA+ Toolbox for visual model checking
  • CLI users: tla2tools.jar for CI integration
  • Both: VS Code extension for syntax highlighting
# Download TLA+ Tools from https://github.com/tlaplus/tlaplus/releases
# Or use Homebrew on macOS
brew install --cask tla-plus-toolbox

# For command-line usage (recommended for CI)
wget https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar

VS Code Extension

Install the TLA+ extension for syntax highlighting and basic validation:

code --install-extension alygin.vscode-tlaplus

Your First TLA+ Specification

Let’s start with a simple example to understand the syntax:

--------------------------- MODULE SimpleCounter ---------------------------
VARIABLE counter

Init == counter = 0

Increment == counter' = counter + 1

Decrement == counter' = counter - 1

Next == Increment \/ Decrement

Spec == Init /\ [][Next]_counter

TypeInvariant == counter \in Int

=============================================================================

This specification defines:

  • State: A counter variable
  • Initial condition: Counter starts at 0
  • Actions: Increment or decrement operations
  • Next state relation: Either action can occur
  • Invariant: Counter is always an integer

Real-World Example: Task Management API

Now let’s build something real. We’ll create a task management API using TLA+ specifications that Claude can implement in Go.

Step 1: Define the System State

First, we model what our system looks like (TaskManagement.tla):

--------------------------- MODULE TaskManagement ---------------------------
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Users,          \* Set of users
    MaxTasks,       \* Maximum number of tasks
    MaxTime,        \* Maximum time value for simulation
    Titles,         \* Set of possible task titles
    Descriptions    \* Set of possible task descriptions

VARIABLES
    tasks,          \* Function from task ID to task record
    userTasks,      \* Function from user ID to set of task IDs
    nextTaskId,     \* Counter for generating unique task IDs
    currentUser,    \* Currently authenticated user
    clock,          \* Global clock for timestamps
    sessions        \* Active user sessions

\* Task states enumeration with valid transitions
TaskStates == {"pending", "in_progress", "completed", "cancelled", "blocked"}

\* Priority levels
Priorities == {"low", "medium", "high", "critical"}

\* Valid state transitions
ValidTransitions == {
    <<"pending", "in_progress">>,
    <<"pending", "cancelled">>,
    <<"pending", "blocked">>,
    <<"in_progress", "completed">>,
    <<"in_progress", "cancelled">>,
    <<"in_progress", "blocked">>,
    <<"in_progress", "pending">>,      \* Allow reverting to pending
    <<"blocked", "pending">>,
    <<"blocked", "in_progress">>,
    <<"blocked", "cancelled">>
}


TaskRecord == [
    id: Nat,
    title: STRING,
    description: STRING,
    status: TaskStates,
    priority: {"low", "medium", "high"},
    assignee: Users,
    createdAt: Nat,
    dueDate: Nat \cup {NULL}
]

\* Type invariants
TypeInvariant == 
    /\ tasks \in [Nat -> TaskRecord]
    /\ userTasks \in [Users -> SUBSET Nat]
    /\ nextTaskId \in Nat
    /\ currentUser \in Users \cup {NULL}

Step 2: Define System Actions

Now we specify what operations are possible (TaskManagement.tla):

\* System initialization
Init ==
    /\ tasks = [i \in {} |-> CHOOSE x : FALSE]  \* Empty function
    /\ userTasks = [u \in Users |-> {}]
    /\ nextTaskId = 1
    /\ currentUser = "NULL"
    /\ clock = 0
    /\ sessions = [u \in Users |-> FALSE]

\* User authentication
Authenticate(user) ==
    /\ user \in Users
    /\ ~sessions[user]  \* User not already logged in
    /\ currentUser' = user
    /\ sessions' = [sessions EXCEPT ![user] = TRUE]
    /\ UNCHANGED <<tasks, userTasks, nextTaskId, clock>>

\* Create a new task
CreateTask(title, description, priority, dueDate) ==
    /\ currentUser # NULL
    /\ nextTaskId <= MaxTasks
    /\ LET newTask == [
           id |-> nextTaskId,
           title |-> title,
           description |-> description,
           status |-> "pending",
           priority |-> priority,
           assignee |-> currentUser,
           createdAt |-> nextTaskId, \* Simplified timestamp
           dueDate |-> dueDate
       ] IN
       /\ tasks' = tasks @@ (nextTaskId :> newTask)
       /\ userTasks' = [userTasks EXCEPT ![currentUser] = @ \cup {nextTaskId}]
       /\ nextTaskId' = nextTaskId + 1
       /\ UNCHANGED currentUser

\* Update task status
UpdateTaskStatus(taskId, newStatus) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ newStatus \in TaskStates
    /\ tasks' = [tasks EXCEPT ![taskId].status = newStatus]
    /\ UNCHANGED <<userTasks, nextTaskId, currentUser>>

\* Delete a task
DeleteTask(taskId) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ tasks' = [id \in (DOMAIN tasks \ {taskId}) |-> tasks[id]]
    /\ userTasks' = [userTasks EXCEPT ![currentUser] = @ \ {taskId}]
    /\ UNCHANGED <<nextTaskId, currentUser>>

Step 3: Safety and Liveness Properties

TLA+ shines when defining system properties (TaskManagement.tla):

\* Safety properties
NoOrphanTasks ==
    \A taskId \in DOMAIN tasks :
        \E user \in Users : taskId \in GetUserTasks(user)

TaskOwnership ==
    \A taskId \in DOMAIN tasks :
        tasks[taskId].assignee \in Users /\
        taskId \in GetUserTasks(tasks[taskId].assignee)

ValidTaskIds ==
    \A taskId \in DOMAIN tasks : 
        /\ taskId < nextTaskId
        /\ taskId >= 1

NoDuplicateTaskIds ==
    \A t1, t2 \in DOMAIN tasks :
        t1 = t2 \/ tasks[t1].id # tasks[t2].id

ValidStateTransitionsInvariant ==
    \A taskId \in DOMAIN tasks :
        tasks[taskId].status \in TaskStates

ConsistentTimestamps ==
    \A taskId \in DOMAIN tasks :
        /\ tasks[taskId].createdAt <= tasks[taskId].updatedAt
        /\ tasks[taskId].updatedAt <= clock

NoCyclicDependencies ==
    LET
        \* Transitive closure of dependencies
        RECURSIVE TransitiveDeps(_)
        TransitiveDeps(taskId) ==
            IF ~TaskExists(taskId) THEN {}
            ELSE LET directDeps == tasks[taskId].dependencies IN
                 directDeps \cup 
                 UNION {TransitiveDeps(dep) : dep \in directDeps}
    IN
    \A taskId \in DOMAIN tasks :
        taskId \notin TransitiveDeps(taskId)

AuthenticationRequired ==
    \* All task operations require authentication
    \A taskId \in DOMAIN tasks :
        tasks[taskId].createdBy \in Users

SafetyInvariant ==
    /\ NoOrphanTasks
    /\ TaskOwnership
    /\ ValidTaskIds
    /\ NoDuplicateTaskIds
    /\ ValidStateTransitionsInvariant
    /\ ConsistentTimestamps
    /\ NoCyclicDependencies
    /\ AuthenticationRequired

\* Next state relation
Next ==
    \/ AdvanceTime
    \/ \E user \in Users : Authenticate(user)
    \/ Logout
    \/ \E t \in Titles, d \in Descriptions, p \in Priorities, 
         u \in Users, dd \in 0..MaxTime \cup {"NULL"},
         tags \in SUBSET {"bug", "feature", "enhancement", "documentation"},
         deps \in SUBSET DOMAIN tasks :
       CreateTask(t, d, p, u, dd, tags, deps)
    \/ \E taskId \in DOMAIN tasks, newStatus \in TaskStates :
       UpdateTaskStatus(taskId, newStatus)
    \/ \E taskId \in DOMAIN tasks, newPriority \in Priorities :
       UpdateTaskPriority(taskId, newPriority)
    \/ \E taskId \in DOMAIN tasks, newAssignee \in Users :
       ReassignTask(taskId, newAssignee)
    \/ \E taskId \in DOMAIN tasks, t \in Titles, 
         d \in Descriptions, dd \in 0..MaxTime \cup {"NULL"} :
       UpdateTaskDetails(taskId, t, d, dd)
    \/ \E taskId \in DOMAIN tasks : DeleteTask(taskId)
    \/ CheckDependencies
    \/ \E taskIds \in SUBSET DOMAIN tasks, newStatus \in TaskStates :
       taskIds # {} /\ BulkUpdateStatus(taskIds, newStatus)

\* Properties to check
THEOREM TypeCorrectness == Spec => []TypeInvariant
THEOREM SafetyHolds == Spec => []SafetyInvariant
THEOREM LivenessHolds == Spec => (EventualCompletion /\ FairProgress)
THEOREM NoDeadlock == Spec => []<>Next
THEOREM Termination == Spec => <>(\A taskId \in DOMAIN tasks : 
                                    tasks[taskId].status \in {"completed", "cancelled"})
=============================================================================

Step 4: Model Checking and Trace Generation

Now we can run TLA+ model checking to verify our specification (TaskManagement.cfg):

\* Model configuration for TaskManagementImproved module
SPECIFICATION Spec

\* Constants definition
CONSTANTS
    Users = {alice, bob, charlie}
    MaxTasks = 5
    MaxTime = 20
    Titles = {task1, task2, task3, task4, task5}
    Descriptions = {desc1, desc2, desc3}

\* Model values for special constants
CONSTANT
    NULL = NULL
    EMPTY_STRING = EMPTY_STRING

\* Initial state constraint
CONSTRAINT
    /\ nextTaskId <= MaxTasks + 1
    /\ clock <= MaxTime
    /\ Cardinality(DOMAIN tasks) <= MaxTasks

\* State space reduction (optional, for faster checking)
ACTION_CONSTRAINT
    \* Limit number of active sessions
    /\ Cardinality({u \in Users : sessions[u] = TRUE}) <= 2
    \* Prevent creating too many tasks at once
    /\ nextTaskId <= MaxTasks

\* Invariants to check
INVARIANT TypeInvariant
INVARIANT SafetyInvariant
INVARIANT NoOrphanTasks
INVARIANT TaskOwnership
INVARIANT ValidTaskIds
INVARIANT NoDuplicateTaskIds
INVARIANT ValidStateTransitionsInvariant
INVARIANT ConsistentTimestamps
INVARIANT NoCyclicDependencies
INVARIANT AuthenticationRequired

\* Properties to check
PROPERTY EventualCompletion
PROPERTY FairProgress
PROPERTY EventualUnblocking
PROPERTY EventualAuthentication
PROPERTY NoStarvation

\* Check for deadlocks
CHECK_DEADLOCK TRUE

\* View for debugging (optional)
VIEW <<nextTaskId, Cardinality(DOMAIN tasks), clock>>

\* Alias for better state visualization
ALIAS TaskSummary == [
    totalTasks |-> Cardinality(DOMAIN tasks),
    pendingTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "pending"}),
    inProgressTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "in_progress"}),
    completedTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "completed"}),
    blockedTasks |-> Cardinality({t \in DOMAIN tasks : tasks[t].status = "blocked"}),
    activeSessions |-> {u \in Users : sessions[u] = TRUE},
    currentUser |-> currentUser,
    time |-> clock
]
# Run the model checker
java -jar tla2tools.jar -config TaskManagement.cfg TaskManagement.tla

TLA+ will explore all possible execution paths and either:

  1. Verify that all properties hold
  2. Generate counterexamples showing property violations

Step 5: Claude Implementation with TLA+ Context

Now comes the magic. Instead of vague natural language, we give Claude the precise TLA+ specification:

My prompt to Claude:

Using this TLA+ specification for a task management system, implement a Go REST API 
that satisfies all the specified behaviors and properties.

[Include the full TLA+ specification here]

Requirements:
1. Implement all actions defined in the TLA+ spec
2. Ensure all safety properties hold at runtime
3. Use Go's type system to enforce the data structures
4. Include comprehensive tests that cover the execution traces
5. Add proper error handling for violation of preconditions

Focus on behavioral correctness over performance optimization.

Claude generates something like (task.go):

// Package domain contains the core business entities matching the TLA+ specification
package domain

import (
	"fmt"
	"time"
)

// TaskID represents a unique task identifier (maps to TLA+ task ID)
type TaskID int

// UserID represents a user identifier (maps to TLA+ Users set)
type UserID string

// TaskStatus represents the state of a task (maps to TLA+ TaskStates)
type TaskStatus string

const (
	StatusPending    TaskStatus = "pending"
	StatusInProgress TaskStatus = "in_progress"
	StatusCompleted  TaskStatus = "completed"
	StatusCancelled  TaskStatus = "cancelled"
	StatusBlocked    TaskStatus = "blocked"
)

// Priority represents task priority levels (maps to TLA+ Priorities)
type Priority string

const (
	PriorityLow      Priority = "low"
	PriorityMedium   Priority = "medium"
	PriorityHigh     Priority = "high"
	PriorityCritical Priority = "critical"
)

// Tag represents task categories (maps to TLA+ tags subset)
type Tag string

const (
	TagBug           Tag = "bug"
	TagFeature       Tag = "feature"
	TagEnhancement   Tag = "enhancement"
	TagDocumentation Tag = "documentation"
)

// Task represents a task entity (maps to TLA+ task record)
type Task struct {
	ID           TaskID            `json:"id"`
	Title        string            `json:"title"`
	Description  string            `json:"description"`
	Status       TaskStatus        `json:"status"`
	Priority     Priority          `json:"priority"`
	Assignee     UserID            `json:"assignee"`
	CreatedBy    UserID            `json:"created_by"`
	CreatedAt    time.Time         `json:"created_at"`
	UpdatedAt    time.Time         `json:"updated_at"`
	DueDate      *time.Time        `json:"due_date,omitempty"`
	Tags         []Tag             `json:"tags"`
	Dependencies map[TaskID]bool   `json:"dependencies"`
}

// ValidTransition represents a valid state transition (maps to TLA+ ValidTransitions)
type ValidTransition struct {
	From TaskStatus
	To   TaskStatus
}

// ValidTransitions defines all allowed state transitions
var ValidTransitions = map[ValidTransition]bool{
	{StatusPending, StatusInProgress}:    true,
	{StatusPending, StatusCancelled}:     true,
	{StatusPending, StatusBlocked}:       true,
	{StatusInProgress, StatusCompleted}:  true,
	{StatusInProgress, StatusCancelled}:  true,
	{StatusInProgress, StatusBlocked}:    true,
	{StatusInProgress, StatusPending}:    true, // Allow reverting
	{StatusBlocked, StatusPending}:       true,
	{StatusBlocked, StatusInProgress}:    true,
	{StatusBlocked, StatusCancelled}:     true,
}

// IsValidTransition checks if a state transition is valid (maps to TLA+ IsValidTransition)
func IsValidTransition(from, to TaskStatus) bool {
	return ValidTransitions[ValidTransition{From: from, To: to}]
}

// CanDelete checks if a task can be deleted (only completed or cancelled)
func (t *Task) CanDelete() bool {
	return t.Status == StatusCompleted || t.Status == StatusCancelled
}

// IsBlocked checks if task should be blocked based on dependencies
func (t *Task) IsBlocked(allTasks map[TaskID]*Task) bool {
	if len(t.Dependencies) == 0 {
		return false
	}
	
	for depID := range t.Dependencies {
		if dep, exists := allTasks[depID]; exists {
			if dep.Status != StatusCompleted {
				return true
			}
		}
	}
	return false
}

// ShouldUnblock checks if a blocked task can be unblocked
func (t *Task) ShouldUnblock(allTasks map[TaskID]*Task) bool {
	if t.Status != StatusBlocked {
		return false
	}
	
	for depID := range t.Dependencies {
		if dep, exists := allTasks[depID]; exists {
			if dep.Status != StatusCompleted {
				return false
			}
		}
	}
	return true
}

// Validate performs domain validation on the task
func (t *Task) Validate() error {
	if t.Title == "" {
		return fmt.Errorf("task title cannot be empty")
	}
	if t.Description == "" {
		return fmt.Errorf("task description cannot be empty")
	}
	if !isValidStatus(t.Status) {
		return fmt.Errorf("invalid task status: %s", t.Status)
	}
	if !isValidPriority(t.Priority) {
		return fmt.Errorf("invalid task priority: %s", t.Priority)
	}
	if t.Assignee == "" {
		return fmt.Errorf("task must have an assignee")
	}
	if t.CreatedBy == "" {
		return fmt.Errorf("task must have a creator")
	}
	if t.CreatedAt.After(t.UpdatedAt) {
		return fmt.Errorf("created time cannot be after updated time")
	}
	for _, tag := range t.Tags {
		if !isValidTag(tag) {
			return fmt.Errorf("invalid tag: %s", tag)
		}
	}
	return nil
}

func isValidStatus(status TaskStatus) bool {
	switch status {
	case StatusPending, StatusInProgress, StatusCompleted, StatusCancelled, StatusBlocked:
		return true
	default:
		return false
	}
}

func isValidPriority(priority Priority) bool {
	switch priority {
	case PriorityLow, PriorityMedium, PriorityHigh, PriorityCritical:
		return true
	default:
		return false
	}
}

func isValidTag(tag Tag) bool {
	switch tag {
	case TagBug, TagFeature, TagEnhancement, TagDocumentation:
		return true
	default:
		return false
	}
}
// Package usecase implements the TLA+ actions as use cases
package usecase

import (
	"crypto/rand"
	"encoding/hex"
	"fmt"
	"time"
	
	"github.com/bhatti/sample-task-management/internal/domain"
	"github.com/bhatti/sample-task-management/internal/repository"
)

// TaskUseCase implements task-related TLA+ actions
type TaskUseCase struct {
	uow              repository.UnitOfWork
	invariantChecker InvariantChecker
}

// InvariantChecker interface for runtime invariant validation
type InvariantChecker interface {
	CheckAllInvariants(state *domain.SystemState) error
	CheckTaskInvariants(task *domain.Task, state *domain.SystemState) error
	CheckTransitionInvariant(from, to domain.TaskStatus) error
}

// NewTaskUseCase creates a new task use case
func NewTaskUseCase(uow repository.UnitOfWork, checker InvariantChecker) *TaskUseCase {
	return &TaskUseCase{
		uow:              uow,
		invariantChecker: checker,
	}
}

// Authenticate implements TLA+ Authenticate action
func (uc *TaskUseCase) Authenticate(userID domain.UserID) (*domain.Session, error) {
	// Preconditions from TLA+:
	// - user \in Users
	// - ~sessions[user]
	
	user, err := uc.uow.Users().GetUser(userID)
	if err != nil {
		return nil, fmt.Errorf("user not found: %w", err)
	}
	
	// Check if user already has an active session
	existingSession, _ := uc.uow.Sessions().GetSessionByUser(userID)
	if existingSession != nil && existingSession.IsValid() {
		return nil, fmt.Errorf("user %s already has an active session", userID)
	}
	
	// Create new session
	token := generateToken()
	session := &domain.Session{
		UserID:    user.ID,
		Token:     token,
		Active:    true,
		CreatedAt: time.Now(),
		ExpiresAt: time.Now().Add(24 * time.Hour),
	}
	
	// Update state
	if err := uc.uow.Sessions().CreateSession(session); err != nil {
		return nil, fmt.Errorf("failed to create session: %w", err)
	}
	
	if err := uc.uow.SystemState().SetCurrentUser(&userID); err != nil {
		return nil, fmt.Errorf("failed to set current user: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return nil, fmt.Errorf("invariant violation: %w", err)
	}
	
	return session, nil
}


// CreateTask implements TLA+ CreateTask action
func (uc *TaskUseCase) CreateTask(
	title, description string,
	priority domain.Priority,
	assignee domain.UserID,
	dueDate *time.Time,
	tags []domain.Tag,
	dependencies []domain.TaskID,
) (*domain.Task, error) {
	// Preconditions from TLA+:
	// - currentUser # NULL
	// - currentUser \in Users
	// - nextTaskId <= MaxTasks
	// - deps \subseteq DOMAIN tasks
	// - \A dep \in deps : tasks[dep].status # "cancelled"
	
	currentUser, err := uc.uow.SystemState().GetCurrentUser()
	if err != nil || currentUser == nil {
		return nil, fmt.Errorf("authentication required")
	}
	
	// Check max tasks limit
	nextID, err := uc.uow.SystemState().GetNextTaskID()
	if err != nil {
		return nil, fmt.Errorf("failed to get next task ID: %w", err)
	}
	
	if nextID > domain.MaxTasks {
		return nil, fmt.Errorf("maximum number of tasks (%d) reached", domain.MaxTasks)
	}
	
	// Validate dependencies
	allTasks, err := uc.uow.Tasks().GetAllTasks()
	if err != nil {
		return nil, fmt.Errorf("failed to get tasks: %w", err)
	}
	
	depMap := make(map[domain.TaskID]bool)
	for _, depID := range dependencies {
		depTask, exists := allTasks[depID]
		if !exists {
			return nil, fmt.Errorf("dependency task %d does not exist", depID)
		}
		if depTask.Status == domain.StatusCancelled {
			return nil, fmt.Errorf("cannot depend on cancelled task %d", depID)
		}
		depMap[depID] = true
	}
	
	// Check for cyclic dependencies
	if err := uc.checkCyclicDependencies(nextID, depMap, allTasks); err != nil {
		return nil, err
	}
	
	// Determine initial status based on dependencies
	status := domain.StatusPending
	if len(dependencies) > 0 {
		// Check if all dependencies are completed
		allCompleted := true
		for depID := range depMap {
			if allTasks[depID].Status != domain.StatusCompleted {
				allCompleted = false
				break
			}
		}
		if !allCompleted {
			status = domain.StatusBlocked
		}
	}
	
	// Create task
	task := &domain.Task{
		ID:           nextID,
		Title:        title,
		Description:  description,
		Status:       status,
		Priority:     priority,
		Assignee:     assignee,
		CreatedBy:    *currentUser,
		CreatedAt:    time.Now(),
		UpdatedAt:    time.Now(),
		DueDate:      dueDate,
		Tags:         tags,
		Dependencies: depMap,
	}
	
	// Validate task
	if err := task.Validate(); err != nil {
		return nil, fmt.Errorf("task validation failed: %w", err)
	}
	
	// Save task
	if err := uc.uow.Tasks().CreateTask(task); err != nil {
		return nil, fmt.Errorf("failed to create task: %w", err)
	}
	
	// Increment next task ID
	if _, err := uc.uow.SystemState().IncrementNextTaskID(); err != nil {
		return nil, fmt.Errorf("failed to increment task ID: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return nil, fmt.Errorf("invariant violation after task creation: %w", err)
	}
	
	return task, nil
}

// UpdateTaskStatus implements TLA+ UpdateTaskStatus action
func (uc *TaskUseCase) UpdateTaskStatus(taskID domain.TaskID, newStatus domain.TaskStatus) error {
	// Preconditions from TLA+:
	// - currentUser # NULL
	// - TaskExists(taskId)
	// - taskId \in GetUserTasks(currentUser)
	// - IsValidTransition(tasks[taskId].status, newStatus)
	// - newStatus = "in_progress" => all dependencies completed
	
	currentUser, err := uc.uow.SystemState().GetCurrentUser()
	if err != nil || currentUser == nil {
		return fmt.Errorf("authentication required")
	}
	
	task, err := uc.uow.Tasks().GetTask(taskID)
	if err != nil {
		return fmt.Errorf("task not found: %w", err)
	}
	
	// Check user owns the task
	userTasks, err := uc.uow.SystemState().GetUserTasks(*currentUser)
	if err != nil {
		return fmt.Errorf("failed to get user tasks: %w", err)
	}
	
	hasTask := false
	for _, id := range userTasks {
		if id == taskID {
			hasTask = true
			break
		}
	}
	
	if !hasTask {
		return fmt.Errorf("user does not have access to task %d", taskID)
	}
	
	// Check valid transition
	if !domain.IsValidTransition(task.Status, newStatus) {
		return fmt.Errorf("invalid transition from %s to %s", task.Status, newStatus)
	}
	
	// Check dependencies if moving to in_progress
	if newStatus == domain.StatusInProgress {
		allTasks, _ := uc.uow.Tasks().GetAllTasks()
		for depID := range task.Dependencies {
			if depTask, exists := allTasks[depID]; exists {
				if depTask.Status != domain.StatusCompleted {
					return fmt.Errorf("cannot start task: dependency %d is not completed", depID)
				}
			}
		}
	}
	
	// Update status
	task.Status = newStatus
	task.UpdatedAt = time.Now()
	
	if err := uc.uow.Tasks().UpdateTask(task); err != nil {
		return fmt.Errorf("failed to update task: %w", err)
	}
	
	// Check invariants
	state, _ := uc.uow.SystemState().GetSystemState()
	if err := uc.invariantChecker.CheckAllInvariants(state); err != nil {
		uc.uow.Rollback()
		return fmt.Errorf("invariant violation: %w", err)
	}
	
	return nil
}
...

Step 6: TLA+ Generated Tests

The real power comes when we use TLA+ execution traces to generate comprehensive tests:

My prompt to Claude:

Generate Go tests that verify the implementation satisfies the TLA+ specification.
Create test cases that:

1. Test all TLA+ actions with valid preconditions
2. Test safety property violations
3. Test edge cases from the TLA+ model boundary conditions
4. Use property-based testing where appropriate

Include tests that would catch the execution traces TLA+ model checker explores.

Claude generates (invariant_test.go):

// Package property implements property-based tests for TLA+ invariants
package property

import (
	"math/rand"
	"testing"
	"time"

	"github.com/bhatti/sample-task-management/internal/domain"
	"github.com/bhatti/sample-task-management/internal/infrastructure/memory"
	"github.com/bhatti/sample-task-management/internal/usecase"
	"github.com/bhatti/sample-task-management/pkg/invariants"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

// TestInvariantsHoldAfterOperations verifies invariants hold after each operation
func TestInvariantsHoldAfterOperations(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()
	uc := usecase.NewTaskUseCase(uow, checker)

	// Setup initial users
	users := []domain.UserID{"alice", "bob", "charlie"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		require.NoError(t, repo.CreateUser(user))
	}

	// Property: Invariants hold after authentication
	t.Run("InvariantsAfterAuthentication", func(t *testing.T) {
		for _, userID := range users {
			session, err := uc.Authenticate(userID)
			assert.NoError(t, err)
			assert.NotNil(t, session)

			state, _ := repo.GetSystemState()
			assert.NoError(t, checker.CheckAllInvariants(state))

			// Cleanup
			_ = uc.Logout(userID)
		}
	})

	// Property: Invariants hold after task creation
	t.Run("InvariantsAfterTaskCreation", func(t *testing.T) {
		uc.Authenticate("alice")

		for i := 0; i < 10; i++ {
			task, err := uc.CreateTask(
				"Task "+string(rune(i)),
				"Description",
				randomPriority(),
				randomUser(users),
				randomDueDate(),
				randomTags(),
				[]domain.TaskID{}, // No dependencies initially
			)

			assert.NoError(t, err)
			assert.NotNil(t, task)

			state, _ := repo.GetSystemState()
			assert.NoError(t, checker.CheckAllInvariants(state))
		}
	})

	// Property: Invariants hold after status transitions
	t.Run("InvariantsAfterStatusTransitions", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create a task
		task, _ := uc.CreateTask(
			"Test Task",
			"Description",
			domain.PriorityMedium,
			"alice",
			nil,
			[]domain.Tag{domain.TagFeature},
			[]domain.TaskID{},
		)

		// Valid transitions
		validTransitions := []domain.TaskStatus{
			domain.StatusInProgress,
			domain.StatusCompleted,
		}

		for _, status := range validTransitions {
			err := uc.UpdateTaskStatus(task.ID, status)
			if err == nil {
				state, _ := repo.GetSystemState()
				assert.NoError(t, checker.CheckAllInvariants(state))
			}
		}
	})

	// Property: No cyclic dependencies can be created
	t.Run("NoCyclicDependencies", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create tasks with potential cycles
		task1, _ := uc.CreateTask("Task1", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{})
		task2, _ := uc.CreateTask("Task2", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{task1.ID})
		task3, _ := uc.CreateTask("Task3", "Desc", domain.PriorityLow, "alice", nil, nil, []domain.TaskID{task2.ID})

		// Attempting to create a cycle should fail
		_, err := uc.CreateTask("Task4", "Desc", domain.PriorityLow, "alice", nil, nil,
			[]domain.TaskID{task3.ID, task1.ID}) // This would create a cycle
		assert.NoError(t, err)

		// Even if it doesn't fail explicitly, invariants should catch it
		state, _ := repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))
	})
}

// TestTransitionInvariants tests state transition validity
func TestTransitionInvariants(t *testing.T) {
	checker := invariants.NewInvariantChecker()

	// Test all valid transitions
	validTransitions := []struct {
		from domain.TaskStatus
		to   domain.TaskStatus
	}{
		{domain.StatusPending, domain.StatusInProgress},
		{domain.StatusPending, domain.StatusCancelled},
		{domain.StatusInProgress, domain.StatusCompleted},
		{domain.StatusInProgress, domain.StatusCancelled},
		{domain.StatusBlocked, domain.StatusPending},
		{domain.StatusBlocked, domain.StatusCancelled},
	}

	for _, trans := range validTransitions {
		t.Run(string(trans.from)+"_to_"+string(trans.to), func(t *testing.T) {
			err := checker.CheckTransitionInvariant(trans.from, trans.to)
			assert.NoError(t, err)
		})
	}

	// Test invalid transitions
	invalidTransitions := []struct {
		from domain.TaskStatus
		to   domain.TaskStatus
	}{
		{domain.StatusCompleted, domain.StatusPending},
		{domain.StatusCompleted, domain.StatusInProgress},
		{domain.StatusCancelled, domain.StatusInProgress},
		{domain.StatusPending, domain.StatusCompleted}, // Must go through in_progress
	}

	for _, trans := range invalidTransitions {
		t.Run("Invalid_"+string(trans.from)+"_to_"+string(trans.to), func(t *testing.T) {
			err := checker.CheckTransitionInvariant(trans.from, trans.to)
			assert.Error(t, err)
		})
	}
}

// TestPropertyTaskOwnership verifies task ownership invariants
func TestPropertyTaskOwnership(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()
	uc := usecase.NewTaskUseCase(uow, checker)

	// Setup users
	users := []domain.UserID{"alice", "bob"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		repo.CreateUser(user)
	}

	// Property: Task reassignment maintains ownership invariants
	t.Run("ReassignmentMaintainsOwnership", func(t *testing.T) {
		uc.Authenticate("alice")

		// Create task assigned to Alice
		task, err := uc.CreateTask(
			"Test Task",
			"Description",
			domain.PriorityHigh,
			"alice",
			nil,
			[]domain.Tag{domain.TagBug},
			[]domain.TaskID{},
		)
		require.NoError(t, err)

		// Check initial ownership
		state, _ := repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))

		aliceTasks := state.GetUserTasks("alice")
		assert.Contains(t, aliceTasks, task.ID)

		// Reassign to Bob
		err = uc.ReassignTask(task.ID, "bob")
		require.NoError(t, err)

		// Check ownership after reassignment
		state, _ = repo.GetSystemState()
		assert.NoError(t, checker.CheckAllInvariants(state))

		aliceTasks = state.GetUserTasks("alice")
		bobTasks := state.GetUserTasks("bob")
		assert.NotContains(t, aliceTasks, task.ID)
		assert.Contains(t, bobTasks, task.ID)
	})
}

// TestPropertyConcurrentOperations tests invariants under concurrent operations
func TestPropertyConcurrentOperations(t *testing.T) {
	repo := memory.NewMemoryRepository()
	uow := memory.NewMemoryUnitOfWork(repo)
	checker := invariants.NewInvariantChecker()

	// Setup users
	users := []domain.UserID{"user1", "user2", "user3"}
	for _, userID := range users {
		user := &domain.User{
			ID:       userID,
			Name:     string(userID),
			Email:    string(userID) + "@example.com",
			JoinedAt: time.Now(),
		}
		repo.CreateUser(user)
	}

	// Run concurrent operations
	done := make(chan bool, len(users))

	for _, userID := range users {
		go func(uid domain.UserID) {
			uc := usecase.NewTaskUseCase(uow, checker)

			// Authenticate
			uc.Authenticate(uid)

			// Create multiple tasks
			for i := 0; i < 5; i++ {
				uc.CreateTask(
					"Task",
					"Description",
					randomPriority(),
					uid,
					nil,
					randomTags(),
					[]domain.TaskID{},
				)

				// Random delay
				time.Sleep(time.Duration(rand.Intn(10)) * time.Millisecond)
			}

			done <- true
		}(userID)
	}

	// Wait for all goroutines
	for i := 0; i < len(users); i++ {
		<-done
	}

	// Check invariants after concurrent operations
	state, _ := repo.GetSystemState()
	assert.NoError(t, checker.CheckAllInvariants(state))
}

// Helper functions

func randomPriority() domain.Priority {
	priorities := []domain.Priority{
		domain.PriorityLow,
		domain.PriorityMedium,
		domain.PriorityHigh,
		domain.PriorityCritical,
	}
	return priorities[rand.Intn(len(priorities))]
}

func randomUser(users []domain.UserID) domain.UserID {
	return users[rand.Intn(len(users))]
}

func randomDueDate() *time.Time {
	if rand.Float32() < 0.5 {
		return nil
	}
	due := time.Now().Add(time.Duration(rand.Intn(30)) * 24 * time.Hour)
	return &due
}

func randomTags() []domain.Tag {
	allTags := []domain.Tag{
		domain.TagBug,
		domain.TagFeature,
		domain.TagEnhancement,
		domain.TagDocumentation,
	}

	numTags := rand.Intn(len(allTags) + 1)
	if numTags == 0 {
		return nil
	}

	tags := make([]domain.Tag, 0, numTags)
	used := make(map[domain.Tag]bool)

	for len(tags) < numTags {
		tag := allTags[rand.Intn(len(allTags))]
		if !used[tag] {
			tags = append(tags, tag)
			used[tag] = true
		}
	}

	return tags
}

Advanced TLA+ Patterns with Claude

Modeling Concurrent Operations

One of TLA+’s strengths is modeling concurrent systems. Let’s extend our specification to handle concurrent task updates:

\* Concurrent task updates with conflict resolution
ConcurrentUpdateTask(taskId, newStatus, version) ==
    /\ currentUser # NULL
    /\ taskId \in DOMAIN tasks
    /\ taskId \in userTasks[currentUser]
    /\ tasks[taskId].version = version  \* Optimistic concurrency control
    /\ tasks' = [tasks EXCEPT ![taskId] = [
                     @ EXCEPT 
                     !.status = newStatus,
                     !.version = @ + 1,
                     !.lastModified = currentUser
                 ]]
    /\ UNCHANGED <<userTasks, nextTaskId, currentUser>>

Prompt to Claude:

Implement optimistic concurrency control for the task updates based on this 
TLA+ specification. Include version tracking and conflict detection.

Modeling Complex Business Rules

TLA+ excels at capturing complex business logic:

\* Business rule: High priority tasks cannot be cancelled directly
ValidStatusTransition(currentStatus, newStatus, priority) ==
    \/ newStatus = currentStatus
    \/ /\ currentStatus = "pending" 
       /\ newStatus \in {"in_progress", "cancelled"}
    \/ /\ currentStatus = "in_progress"
       /\ newStatus \in {"completed", "pending"}
    \/ /\ currentStatus = "in_progress"
       /\ newStatus = "cancelled"
       /\ priority # "high"  \* High priority tasks cannot be cancelled

Lessons Learned

After applying this TLA+ approach to several experimental projects, here are the key insights:

1. Start Small

Begin with core actions and properties. TLA+ specifications can grow complex quickly, so start with the essential behaviors:

\* Start with basic CRUD
Init, CreateTask, UpdateTask, DeleteTask

\* Add complexity incrementally  
Authentication, Authorization, Concurrency, Business Rules

Avoid Initially: Complex distributed systems, performance-critical algorithms

Graduate To: Multi-service interactions, complex business logic

2. Properties Drive Design

Writing TLA+ properties often reveals design flaws before implementation:

\* This property might fail, revealing a design issue
ConsistencyProperty == 
    \A user \in Users:
        \A taskId \in userTasks[user]:
            /\ taskId \in DOMAIN tasks
            /\ tasks[taskId].assignee = user
            /\ tasks[taskId].status # "deleted"  \* Soft delete consideration

3. Model Checking Finds Edge Cases

TLA+ model checking explores execution paths you’d never think to test:

# TLA+ finds this counterexample:
# Step 1: User1 creates Task1
# Step 2: User1 deletes Task1  
# Step 3: User2 creates Task2 (gets same ID due to reuse)
# Step 4: User1 tries to update Task1 -> Security violation!

This led to using UUIDs instead of incrementing integers for task IDs.

4. Generated Tests Are Comprehensive

TLA+ execution traces become your regression test suite. When Claude implements based on TLA+ specs, you get:

  • Complete coverage – All specification paths tested
  • Edge case detection – Boundary conditions from model checking
  • Behavioral contracts – Tests verify actual system properties

Documentation Generation

Prompt to Claude:

Generate API documentation from this TLA+ specification that includes:
1. Endpoint descriptions derived from TLA+ actions
2. Request/response schemas from TLA+ data structures  
3. Error conditions from TLA+ preconditions
4. Behavioral guarantees from TLA+ properties

Code Review Guidelines

With TLA+ specifications, code reviews become more focused:

  1. Does implementation satisfy the TLA+ spec?
  2. Are all preconditions checked?
  3. Do safety properties hold?
  4. Are error conditions handled as specified?

Comparing Specification Approaches

ApproachPrecisionAI EffectivenessMaintenanceLearning CurveTool ComplexityCode Generation
Vibe CodingLowInconsistentHighLowLowN/A
UML/MDDMediumPoorVery HighHighVery HighBrittle
BDD/GherkinMediumBetterMediumMediumLowLimited
TLA+ SpecsHighExcellentLowHighLowReliable

Tools and Resources

Essential TLA+ Resources

  • Learn TLA+: https://learntla.com – Interactive tutorial
  • TLA+ Video Course: Leslie Lamport’s official course
  • Practical TLA+: Hillel Wayne’s book – focus on software systems
  • TLA+ Examples: https://github.com/tlaplus/Examples

Common Mistakes

1. Avoid These Mistakes

? Writing TLA+ like code

\* Wrong - this looks like pseudocode
CreateTask == 
    if currentUser != null then
        task = new Task()

? Writing TLA+ as mathematical relations

\* Right - mathematical specification  
CreateTask == 
    /\ currentUser # NULL
    /\ tasks' = tasks @@ (nextTaskId :> newTask)

? Asking Claude to “fix the TLA+ to match the code”

The spec is the truth – fix the code to match the spec

? Asking Claude to “implement this TLA+ specification correctly”

? Specification scope creep: Starting with entire system architecture ? Incremental approach: Begin with one core workflow, expand gradually

2. Claude Integration Pitfalls

? “Fix the spec to match my code”: Treating specifications as documentation ? “Fix the code to match the spec”: Specifications are the source of truth

3. The Context Overload Trap

Problem: Dumping too much information at once
Solution: Break complex features into smaller, focused requests

4. The “Fix My Test” Antipattern

Problem: When tests fail, asking Claude to modify the test instead of the code
Solution: Always fix the implementation, not the test (unless the test is genuinely wrong)

5. The Blind Trust Mistake

Problem: Accepting generated code without understanding it
Solution: Always review and understand the code before committing

Proven Patterns

1. Save effective prompts:

# ~/.claude/tla-prompts/implementation.md
Implement [language] code that satisfies this TLA+ specification:

[SPEC]

Requirements:
- All TLA+ actions become functions/methods
- All preconditions become runtime checks  
- All data structures match TLA+ types
- Include comprehensive tests covering specification traces

Create specification templates:

--------------------------- MODULE [ModuleName] ---------------------------
EXTENDS Integers, Sequences, FiniteSets

CONSTANTS [Constants]

VARIABLES [StateVariables]

[TypeDefinitions]

Init == [InitialConditions]

[Actions]

Next == [ActionDisjunction]

Spec == Init /\ [][Next]_[StateVariables]

[SafetyProperties]

[LivenessProperties]

=============================================================================

2. The “Explain First” Pattern

Before asking Claude to implement something complex, I ask for an explanation:

Explain how you would implement real-time task updates using WebSockets. 
What are the trade-offs between Socket.io and native WebSockets?
What state management challenges should I consider?

3. The “Progressive Enhancement” Pattern

Start simple, then add complexity:

1. First: "Create a basic task model with CRUD operations"
2. Then: "Add validation and error handling"
3. Then: "Add authentication and authorization"
4. Finally: "Add real-time updates and notifications"

4. The “Code Review” Pattern

After implementation, I ask Claude to review its own code:

Review the task API implementation for:
- Security vulnerabilities
- Performance issues
- Code style consistency
- Missing error cases

Be critical and suggest improvements.

What’s Next

As I’ve developed this TLA+/Claude workflow, I’ve realized we’re approaching something profound: specifications as the primary artifact. Instead of writing code and hoping it’s correct, we’re defining correct behavior formally and letting AI generate the implementation. This inverts the traditional relationship between specification and code.

Implications for Software Engineering

  1. Design-first development becomes natural
  2. Bug prevention replaces bug fixing
  3. Refactoring becomes re-implementation from stable specs
  4. Documentation is always up-to-date (it’s the spec)

I’m currently experimenting with:

  • TLA+ to test case generation – Automated comprehensive testing
  • Multi-language implementations – Same spec, different languages
  • Specification composition – Building larger systems from verified components
  • Quint specifications – A modern executable specification language with simpler syntax than TLA+

Conclusion: The End of Vibe Coding

After using TLA+ with Claude, I can’t go back to vibe coding. The precision, reliability, and confidence that comes from executable specifications has transformed how I build software. The complete working example—TLA+ specs, Go implementation, comprehensive tests, and CI/CD pipeline—is available at github.com/bhatti/sample-task-management.

Yes, there’s a learning curve. Yes, writing TLA+ specifications takes time upfront. But the payoff—in terms of correctness, maintainability, and development speed—is extraordinary. Claude becomes not just a code generator, but a reliable engineering partner that can reason about complex systems precisely because we’ve given it precise specifications to work from. We’re moving from “code and hope” to “specify and know”—and that changes everything.


August 16, 2025

The Complete Guide to gRPC Load Balancing in Kubernetes and Istio

Filed under: Computing,Web Services — Tags: , — admin @ 12:05 pm

TL;DR – The Test Results Matrix

ConfigurationLoad BalancingWhy
Local gRPC? NoneSingle server instance
Kubernetes + gRPC? NoneConnection-level LB only
Kubernetes + Istio? PerfectL7 proxy with request-level LB
Client-side LB?? LimitedRequires multiple endpoints
kubectl port-forward + Istio? NoneBypasses service mesh

Complete test suite ?


Introduction: The gRPC Load Balancing Problem

When you deploy a gRPC service in Kubernetes with multiple replicas, you expect load balancing. You won’t get it. This guide tests every possible configuration to prove why, and shows exactly how to fix it. According to the official gRPC documentation:

“gRPC uses HTTP/2, which multiplexes multiple calls on a single TCP connection. This means that once the connection is established, all gRPC calls will go to the same backend.”


Complete Test Matrix

We’ll test 6 different configurations:

  1. Baseline: Local Testing (Single server)
  2. Kubernetes without Istio (Standard deployment)
  3. Kubernetes with Istio (Service mesh solution)
  4. Client-side Load Balancing (gRPC built-in)
  5. Advanced Connection Testing (Multiple connections)
  6. Real-time Monitoring (Live traffic analysis)

Prerequisites

git clone https://github.com/bhatti/grpc-lb-test
cd grpc-lb-test

# Build all components
make build

Test 1: Baseline – Local Testing

Purpose: Establish baseline behavior with a single server.

# Terminal 1: Start local server
./bin/server

# Terminal 2: Test with basic client
./bin/client -target localhost:50051 -requests 50

Expected Result:

? Load Distribution Results:
Server: unknown-1755316152
Pod: unknown (IP: unknown)
Requests: 50 (100.0%)
????????????????????
? Total servers hit: 1
?? WARNING: All requests went to a single server!
This indicates NO load balancing is happening.

Analysis: This confirms our client implementation works correctly and establishes the baseline.


Test 2: Kubernetes Without Istio

Purpose: Prove that standard Kubernetes doesn’t provide gRPC request-level load balancing.

Deploy the Service

# Deploy 5 replicas without Istio
./scripts/test-without-istio.sh

The k8s/without-istio/deployment.yaml creates:

  • 5 gRPC server replicas
  • Standard Kubernetes Service
  • No Istio annotations

Test Results

???? Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-gh5z5-1755316388
  Pod: grpc-echo-server-5b657689db-gh5z5 (IP: 10.1.4.148)
  Requests: 30 (100.0%)
  ????????????????????

???? Total servers hit: 1
??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.

???? Connection Analysis:
Without Istio, gRPC maintains a single TCP connection to the Kubernetes Service IP.
The kube-proxy performs L4 load balancing, but gRPC reuses the same connection.

???? Cleaning up...
deployment.apps "grpc-echo-server" deleted
service "grpc-echo-service" deleted
./scripts/test-without-istio.sh: line 57: 17836 Terminated: 15   
kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1

??  RESULT: No load balancing observed - all requests went to single pod!

Why This Happens

The Kubernetes Service documentation explains:

“For each Service, kube-proxy installs iptables rules which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service’s backend endpoints.”

Kubernetes Services perform L4 (connection-level) load balancing, but gRPC maintains persistent connections.

Connection Analysis

Run the analysis tool to see connection behavior:

./bin/analysis -target localhost:50051 -requests 100 -test-scenarios true

Result:

? NO LOAD BALANCING: All requests to single server

???? Connection Reuse Analysis:
  Average requests per connection: 1.00
  ??  Low connection reuse (many short connections)

? Connection analysis complete!

Test 3: Kubernetes With Istio

Purpose: Demonstrate how Istio’s L7 proxy solves the load balancing problem.

Install Istio

./scripts/install-istio.sh

This follows Istio’s official installation guide:

istioctl install --set profile=demo -y
kubectl label namespace default istio-injection=enabled

Deploy With Istio

./scripts/test-with-istio.sh

The k8s/with-istio/deployment.yaml includes:

annotations:
  sidecar.istio.io/inject: "true"
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-echo-service
spec:
  host: grpc-echo-service
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 100
        maxRequestsPerConnection: 10
    loadBalancer:
      simple: ROUND_ROBIN

Critical Testing Gotcha

? Wrong way (what most people do):

kubectl port-forward service/grpc-echo-service 50051:50051
./bin/client -target localhost:50051 -requests 50
# Result: Still no load balancing!

According to Istio’s architecture docs, kubectl port-forward bypasses the Envoy sidecar proxy.

? Correct Testing Method

Test from inside the service mesh:

./scripts/test-with-istio.sh

Test Results With Istio

???? Load Distribution Results:
================================

Server: grpc-echo-server-579dfbc76b-m2v7x-1755357769
  Pod: grpc-echo-server-579dfbc76b-m2v7x (IP: 10.1.4.237)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-fkgkk-1755357769
  Pod: grpc-echo-server-579dfbc76b-fkgkk (IP: 10.1.4.240)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-bsjdv-1755357769
  Pod: grpc-echo-server-579dfbc76b-bsjdv (IP: 10.1.4.241)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-dw2m7-1755357770
  Pod: grpc-echo-server-579dfbc76b-dw2m7 (IP: 10.1.4.236)
  Requests: 10 (20.0%)
  ????????

Server: grpc-echo-server-579dfbc76b-x85jm-1755357769
  Pod: grpc-echo-server-579dfbc76b-x85jm (IP: 10.1.4.238)
  Requests: 10 (20.0%)
  ????????

???? Total unique servers: 5

? Load balancing detected across 5 servers!
   Expected requests per server: 10.0
   Distribution variance: 0.00

How Istio Solves This

From Istio’s traffic management documentation:

“Envoy proxies are deployed as sidecars to services, logically augmenting the services with traffic management capabilities… Envoy proxies are the only Istio components that interact with data plane traffic.”

Istio’s solution:

  1. Envoy sidecar intercepts all traffic
  2. Performs L7 (application-level) load balancing
  3. Maintains connection pools to all backends
  4. Routes each request independently

Test 4: Client-Side Load Balancing

Purpose: Test gRPC’s built-in client-side load balancing capabilities.

Standard Client-Side LB

./scripts/test-client-lb.sh

The cmd/client-lb/main.go implements gRPC’s native load balancing:

conn, err := grpc.Dial(
    "dns:///"+target,
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Results and Limitations

 Load Distribution Results:
================================
Server: grpc-echo-server-5b657689db-g9pbw-1755359830
  Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
  Requests: 10 (100.0%)
  ????????????????????

???? Total servers hit: 1
??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
? Normal client works - service is accessible

???? Test 2: Client-side round-robin (from inside cluster)
?????????????????????????????????????????????????????
Creating test pod inside cluster for proper DNS resolution...
pod "client-lb-test" deleted
./scripts/test-client-lb.sh: line 71: 48208 Terminated: 15          kubectl port-forward service/grpc-echo-service 50051:50051 > /dev/null 2>&1

??  Client-side LB limitation explanation:
   gRPC client-side round-robin expects multiple A records
   But Kubernetes Services return only one ClusterIP
   Result: 'no children to pick from' error

???? What happens with client-side LB:
   1. Client asks DNS for: grpc-echo-service
   2. DNS returns: 10.105.177.23 (single IP)
   3. gRPC round-robin needs: multiple IPs for load balancing
   4. Result: Error 'no children to pick from'

? This proves client-side LB doesn't work with K8s Services!

???? Test 3: Demonstrating the DNS limitation
?????????????????????????????????????????????
What gRPC client-side LB sees:
   Service name: grpc-echo-service:50051
   DNS resolution: 10.105.177.23:50051
   Available endpoints: 1 (needs multiple for round-robin)

What gRPC client-side LB needs:
   Multiple A records from DNS, like:
   grpc-echo-service ? 10.1.4.241:50051
   grpc-echo-service ? 10.1.4.240:50051
   grpc-echo-service ? 10.1.4.238:50051
   (But Kubernetes Services don't provide this)

???? Test 4: Alternative - Multiple connections
????????????????????????????????????????????
Testing alternative approach with multiple connections...

???? Configuration:
   Target: localhost:50052
   API: grpc.Dial
   Load Balancing: round-robin
   Multi-endpoint: true
   Requests: 20

???? Using multi-endpoint resolver

???? Sending 20 unary requests...

? Request 1 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 2 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 3 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 4 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 5 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 6 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 7 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 8 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 9 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 10 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
? Request 11 -> Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)

? Successful requests: 20/20

???? Load Distribution Results:
================================

Server: grpc-echo-server-5b657689db-g9pbw-1755359830
  Pod: grpc-echo-server-5b657689db-g9pbw (IP: 10.1.4.242)
  Requests: 20 (100.0%)
  ????????????????????????????????????????

???? Total unique servers: 1

??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
   This is expected for gRPC without Istio or special configuration.
? Multi-connection approach works!
   (This simulates multiple endpoints for testing)

???????????????????????????????????????????????????????????????
                         SUMMARY
???????????????????????????????????????????????????????????????

? KEY FINDINGS:
   • Standard gRPC client: Works (uses single connection)
   • Client-side round-robin: Fails (needs multiple IPs)
   • Kubernetes DNS: Returns single ClusterIP only
   • Alternative: Multiple connections can work

???? CONCLUSION:
   Client-side load balancing doesn't work with standard
   Kubernetes Services because they provide only one IP address.
   This proves why Istio (L7 proxy) is needed for gRPC load balancing!

Why this fails: Kubernetes Services provide a single ClusterIP, not multiple IPs for DNS resolution.

From the gRPC load balancing documentation:

“The gRPC client will use the list of IP addresses returned by the name resolver and distribute RPCs among them.”

Alternative: Multiple Connections

Start five instances of servers with different ports:

# Terminal 1
GRPC_PORT=50051 ./bin/server

# Terminal 2  
GRPC_PORT=50052 ./bin/server

# Terminal 3
GRPC_PORT=50053 ./bin/server

# Terminal 4
GRPC_PORT=50054 ./bin/server

# Terminal 5
GRPC_PORT=50055 ./bin/server

The cmd/client-v2/main.go implements manual connection management:

./bin/client-v2 -target localhost:50051 -requests 50 -multi-endpoint

Results:

???? Load Distribution Results:
================================

Server: unknown-1755360953
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360963
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360970
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360980
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

Server: unknown-1755360945
  Pod: unknown (IP: unknown)
  Requests: 10 (20.0%)
  ????????

???? Total unique servers: 5

? Load balancing detected across 5 servers!
   Expected requests per server: 10.0
   Distribution variance: 0.00

Test 5: Advanced Connection Testing

Purpose: Analyze connection patterns and performance implications.

Multiple Connection Strategy

./bin/advanced-client \
  -target localhost:50051 \
  -requests 1000 \
  -clients 10 \
  -connections 5

Results:

???? Detailed Load Distribution Results:
=====================================
Test Duration: 48.303709ms
Total Requests: 1000
Failed Requests: 0
Requests/sec: 20702.34

Server Distribution:

Server: unknown-1755360945
  Pod: unknown (IP: unknown)
  Requests: 1000 (100.0%)
  First seen: 09:18:51.842
  Last seen: 09:18:51.874
  ????????????????????????????????????????

???? Analysis:
Total unique servers: 1
Average requests per server: 1000.00
Standard deviation: 0.00

??  WARNING: All requests went to a single server!
   This indicates NO load balancing is happening.
   This is expected behavior for gRPC without Istio.

Even sophisticated connection pooling can’t overcome the fundamental issue:
• Multiple connections to SAME endpoint = same server
• Advanced client techniques ? load balancing
• Connection management ? request distribution

Performance Comparison

./scripts/benchmark.sh

???? Key Insights:
• Single server: High performance, no load balancing
• Multiple connections: Same performance, still no LB
• Kubernetes: Small overhead, still no LB
• Istio: Small additional overhead, but enables LB
• Client-side LB: Complex setup, limited effectiveness


Official Documentation References

gRPC Load Balancing

From the official gRPC blog:

“Load balancing within gRPC happens on a per-call basis, not a per-connection basis. In other words, even if all requests come from a single client, we want to distribute them across all servers.”

The problem: Standard deployments don’t achieve per-call balancing.

Istio’s Solution

From Istio’s service mesh documentation:

“Istio’s data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. These proxies mediate and control all network communication between microservices.”

Kubernetes Service Limitations

From Kubernetes networking concepts:

“kube-proxy… only supports TCP and UDP… doesn’t understand HTTP and doesn’t provide load balancing for HTTP requests.”


Complete Test Results Summary

After running comprehensive tests across all possible gRPC load balancing configurations, here are the definitive results that prove the fundamental limitations and solutions:

???? Core Test Matrix Results

ConfigurationLoad BalancingServers HitDistributionKey Insight
Local gRPC? None1/1 (100%)Single serverBaseline behavior confirmed
Kubernetes + gRPC? None1/5 (100%)Single podK8s Services don’t solve it
Kubernetes + Istio? Perfect5/5 (20% each)Even distributionIstio enables true LB
Client-side LB? Failed1/5 (100%)Single podDNS limitation fatal
kubectl port-forward + Istio? None1/5 (100%)Single podTesting methodology matters
Advanced multi-connection? None1/1 (100%)Single endpointComplex ? effective

???? Detailed Test Scenario Analysis

Scenario 1: Baseline Tests

Local single server:     ? PASS - 50 requests ? 1 server (100%)
Local multiple conn:     ? PASS - 1000 requests ? 1 server (100%)

Insight: Confirms gRPC’s connection persistence behavior. Multiple connections to same endpoint don’t change distribution.

Scenario 2: Kubernetes Standard Deployment

K8s without Istio:      ? PASS - 50 requests ? 1 pod (100%)
Expected behavior:      ? NO load balancing
Actual behavior:        ? NO load balancing  

Insight: Standard Kubernetes deployment with 5 replicas provides zero request-level load balancing for gRPC services.

Scenario 3: Istio Service Mesh

K8s with Istio (port-forward):  ??  BYPASS - 50 requests ? 1 pod (100%)
K8s with Istio (in-mesh):       ? SUCCESS - 50 requests ? 5 pods (20% each)

Insight: Istio provides perfect load balancing when tested correctly. Port-forward testing gives false negatives.

Scenario 4: Client-Side Approaches

DNS round-robin:        ? FAIL - "no children to pick from"
Multi-endpoint client:  ? PARTIAL - Works with manual endpoint management
Advanced connections:   ? FAIL - Still single endpoint limitation

Insight: Client-side solutions are complex, fragile, and limited in Kubernetes environments.

???? Deep Technical Analysis

The DNS Problem (Root Cause)

Our testing revealed the fundamental architectural issue:

# What Kubernetes provides
nslookup grpc-echo-service
? 10.105.177.23 (single ClusterIP)

# What gRPC client-side LB needs  
nslookup grpc-echo-service
? 10.1.4.241, 10.1.4.242, 10.1.4.243, 10.1.4.244, 10.1.4.245 (multiple IPs)

Impact: This single vs. multiple IP difference makes client-side load balancing architecturally impossible with standard Kubernetes Services.

Connection Persistence Evidence

Our advanced client test with 1000 requests, 10 concurrent clients, and 5 connections:

Test Duration: 48ms
Requests/sec: 20,702
Servers Hit: 1 (100%)
Connection Reuse: Perfect (efficient but unbalanced)

Conclusion: Even sophisticated connection management can’t overcome the single-endpoint limitation.

Istio’s L7 Magic

Comparing the same test scenario:

# Without Istio
50 requests ? grpc-echo-server-abc123 (100%)

# With Istio  
50 requests ? 5 different pods (20% each)
Distribution variance: 0.00 (perfect)

Technical Detail: Istio’s Envoy sidecar performs request-level routing, creating independent routing decisions for each gRPC call.

? Performance Impact Analysis

Based on our benchmark results:

ConfigurationReq/sOverheadLoad BalancingProduction Suitable
Local baseline~25,0000%None? Not scalable
K8s standard~22,00012%None? Unbalanced
K8s + Istio~20,00020%Perfect? Recommended
Client-side~23,0008%Complex?? Maintenance burden

Insight: Istio’s 20% performance overhead is a reasonable trade-off for enabling proper load balancing and gaining a production-ready service mesh.


Production Recommendations

For Development Teams:

  1. Standard Kubernetes deployment of gRPC services will not load balance
  2. Istio is the proven solution for production gRPC load balancing
  3. Client-side approaches add complexity without solving the fundamental issue
  4. Testing methodology critically affects results (avoid port-forward for Istio tests)

For Architecture Decisions:

  1. Plan for Istio if deploying multiple gRPC services
  2. Accept the 20% performance cost for operational benefits
  3. Avoid client-side load balancing in Kubernetes environments
  4. Use proper testing practices to validate service mesh behavior

For Production Readiness:

  1. Istio + DestinationRules provide enterprise-grade gRPC load balancing
  2. Monitoring and observability come built-in with Istio
  3. Circuit breaking and retry policies integrate seamlessly
  4. Zero client-side complexity reduces maintenance burden

???? Primary Recommendation: Istio Service Mesh

Our testing proves Istio is the only solution that provides reliable gRPC load balancing in Kubernetes:

# Production-tested DestinationRule configuration
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service-production
spec:
  host: grpc-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10  # Tested: Ensures request distribution
        connectTimeout: 30s
    loadBalancer:
      simple: LEAST_REQUEST  # Better than ROUND_ROBIN for varying request costs
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Why this configuration works:

  • maxRequestsPerConnection: 10 – Forces connection rotation (tested in our scenario)
  • LEAST_REQUEST – Better performance than round-robin for real workloads
  • outlierDetection – Automatic failure handling (something client-side LB can’t provide)

Expected results based on our testing:

  • ? Perfect 20% distribution across 5 replicas
  • ? ~20% performance overhead (trade-off worth it)
  • ? Built-in observability and monitoring
  • ? Zero client-side complexity

???? Configuration Best Practices

1. Enable Istio Injection Properly

# Enable for entire namespace (recommended)
kubectl label namespace production istio-injection=enabled

# Or per-deployment (more control)
metadata:
  annotations:
    sidecar.istio.io/inject: "true"

2. Validate Load Balancing is Working

# WRONG: This will show false negatives
kubectl port-forward service/grpc-service 50051:50051

# CORRECT: Test from inside the mesh
kubectl run test-client --rm -it --restart=Never \
  --image=your-grpc-client \
  --annotations="sidecar.istio.io/inject=true" \
  -- ./client -target grpc-service:50051 -requests 100

3. Monitor Distribution Quality

# Check Envoy stats for load balancing
kubectl exec deployment/grpc-service -c istio-proxy -- \
  curl localhost:15000/stats | grep upstream_rq_

?? What NOT to Do (Based on Our Test Failures)

1. Don’t Rely on Standard Kubernetes Services

# This WILL NOT load balance gRPC traffic
apiVersion: v1
kind: Service
metadata:
  name: grpc-service
spec:
  ports:
  - port: 50051
  selector:
    app: grpc-server
# Result: 100% traffic to single pod (proven in our tests)

2. Don’t Use Client-Side Load Balancing

// This approach FAILS in Kubernetes (tested and failed)
conn, err := grpc.Dial(
    "dns:///grpc-service:50051",
    grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
)
// Error: "no children to pick from" (proven in our tests)

3. Don’t Implement Complex Connection Pooling

// This adds complexity without solving the core issue
type LoadBalancedClient struct {
    conns []grpc.ClientConnInterface
    next  int64
}
// Still results in 100% traffic to single endpoint (proven in our tests)

???? Alternative Solutions (If Istio Not Available)

If you absolutely cannot use Istio, here are the only viable alternatives (with significant caveats):

Option 1: External Load Balancer with HTTP/2 Support

# Use nginx/envoy/haproxy outside Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: grpc-service-lb
spec:
  type: LoadBalancer
  ports:
  - port: 50051
    targetPort: 50051

Limitations: Requires external infrastructure, loss of Kubernetes-native benefits

Option 2: Headless Service + Custom Service Discovery

apiVersion: v1
kind: Service
metadata:
  name: grpc-service-headless
spec:
  clusterIP: None  # Headless service
  ports:
  - port: 50051
  selector:
    app: grpc-server

Limitations: Complex client implementation, manual health checking


Conclusion

After testing every possible gRPC load balancing configuration in Kubernetes, the evidence is clear and definitive:

  • Standard Kubernetes + gRPC = Zero load balancing (100% traffic to single pod)
  • The problem is architectural, not implementation
  • Client-side solutions fail due to DNS limitations (“no children to pick from”)
  • Complex workarounds add overhead without solving the core issue

???? Istio is the Proven Solution

The evidence overwhelmingly supports Istio as the production solution:

  • ? Perfect load balancing: 20% distribution across 5 pods (0.00 variance)
  • ? Reasonable overhead: 20% performance cost for complete solution
  • ? Production features: Circuit breaking, retries, observability included
  • ? Zero client complexity: Works transparently with existing gRPC clients

???? Critical Testing Insight

Our testing revealed a major pitfall that leads to incorrect conclusions:

  • kubectl port-forward bypasses Istio ? false negative results
  • Most developers get wrong results when testing Istio + gRPC
  • Always test from inside the service mesh for accurate results

Full test suite and results ?

August 15, 2025

Building Robust Error Handling with gRPC and REST APIs

Filed under: Computing,Web Services — admin @ 2:23 pm

Introduction

Error handling is often an afterthought in API development, yet it’s one of the most critical aspects of a good developer experience. For example, a cryptic error message like { "error": "An error occurred" } can lead to hours of frustrating debugging. In this guide, we will build a robust, production-grade error handling framework for a Go application that serves both gRPC and a REST/HTTP proxy based on industry standards like RFC9457 (Problem Details for HTTP APIs) and RFC7807 (obsoleted).

Tenets

Following are tenets of a great API error:

  1. Structured: machine-readable, not just a string.
  2. Actionable: explains the developer why the error occurred and, if possible, how to fix it.
  3. Consistent: all errors, from validation to authentication to server faults, follow the same format.
  4. Secure: never leaks sensitive internal information like stack traces or database schemas.

Our North Star for HTTP errors will be the Problem Details for HTTP APIs (RFC 9457/7807):

{
  "type": "https://example.com/docs/errors/validation-failed",
  "title": "Validation Failed",
  "status": 400,
  "detail": "The request body failed validation.",
  "instance": "/v1/todos",
  "invalid_params": [
    {
      "field": "title",
      "reason": "must not be empty"
    }
  ]
}

We will adapt this model for gRPC by embedding a similar structure in the gRPC status details, creating a single source of truth for all errors.

API Design

Let’s start by defining our TODO API in Protocol Buffers:

syntax = "proto3";

package todo.v1;

import "google/api/annotations.proto";
import "google/api/field_behavior.proto";
import "google/api/resource.proto";
import "google/protobuf/timestamp.proto";
import "google/protobuf/field_mask.proto";
import "buf/validate/validate.proto";

option go_package = "github.com/bhatti/todo-api-errors/api/proto/todo/v1;todo";

// TodoService provides task management operations
service TodoService {
  // CreateTask creates a new task
  rpc CreateTask(CreateTaskRequest) returns (Task) {
    option (google.api.http) = {
      post: "/v1/tasks"
      body: "*"
    };
  }

  // GetTask retrieves a specific task
  rpc GetTask(GetTaskRequest) returns (Task) {
    option (google.api.http) = {
      get: "/v1/{name=tasks/*}"
    };
  }

  // ListTasks retrieves all tasks
  rpc ListTasks(ListTasksRequest) returns (ListTasksResponse) {
    option (google.api.http) = {
      get: "/v1/tasks"
    };
  }

  // UpdateTask updates an existing task
  rpc UpdateTask(UpdateTaskRequest) returns (Task) {
    option (google.api.http) = {
      patch: "/v1/{task.name=tasks/*}"
      body: "task"
    };
  }

  // DeleteTask removes a task
  rpc DeleteTask(DeleteTaskRequest) returns (DeleteTaskResponse) {
    option (google.api.http) = {
      delete: "/v1/{name=tasks/*}"
    };
  }

  // BatchCreateTasks creates multiple tasks at once
  rpc BatchCreateTasks(BatchCreateTasksRequest) returns (BatchCreateTasksResponse) {
    option (google.api.http) = {
      post: "/v1/tasks:batchCreate"
      body: "*"
    };
  }
}

// Task represents a TODO item
message Task {
  option (google.api.resource) = {
    type: "todo.example.com/Task"
    pattern: "tasks/{task}"
    singular: "task"
    plural: "tasks"
  };

  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = IDENTIFIER,
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Task title
  string title = 2 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).string = {
      min_len: 1
      max_len: 200
    }
  ];

  // Task description
  string description = 3 [
    (google.api.field_behavior) = OPTIONAL,
    (buf.validate.field).string = {
      max_len: 1000
    }
  ];

  // Task status
  Status status = 4 [
    (google.api.field_behavior) = REQUIRED
  ];

  // Task priority
  Priority priority = 5 [
    (google.api.field_behavior) = OPTIONAL
  ];

  // Due date for the task
  google.protobuf.Timestamp due_date = 6 [
    (google.api.field_behavior) = OPTIONAL,
    (buf.validate.field).timestamp = {
      gt_now: true
    }
  ];

  // Task creation time
  google.protobuf.Timestamp create_time = 7 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Task last update time
  google.protobuf.Timestamp update_time = 8 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // User who created the task
  string created_by = 9 [
    (google.api.field_behavior) = OUTPUT_ONLY
  ];

  // Tags associated with the task
  repeated string tags = 10 [
    (buf.validate.field).repeated = {
      max_items: 10
      items: {
        string: {
          pattern: "^[a-z0-9-]+$"
          max_len: 50
        }
      }
    }
  ];
}

// Task status enumeration
enum Status {
  STATUS_UNSPECIFIED = 0;
  STATUS_PENDING = 1;
  STATUS_IN_PROGRESS = 2;
  STATUS_COMPLETED = 3;
  STATUS_CANCELLED = 4;
}

// Task priority enumeration
enum Priority {
  PRIORITY_UNSPECIFIED = 0;
  PRIORITY_LOW = 1;
  PRIORITY_MEDIUM = 2;
  PRIORITY_HIGH = 3;
  PRIORITY_CRITICAL = 4;
}

// CreateTaskRequest message
message CreateTaskRequest {
  // Task to create
  Task task = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];
}

// GetTaskRequest message
message GetTaskRequest {
  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = REQUIRED,
    (google.api.resource_reference) = {
      type: "todo.example.com/Task"
    },
    (buf.validate.field).string = {
      pattern: "^tasks/[a-zA-Z0-9-]+$"
    }
  ];
}

// ListTasksRequest message
message ListTasksRequest {
  // Maximum number of tasks to return
  int32 page_size = 1 [
    (buf.validate.field).int32 = {
      gte: 0
      lte: 1000
    }
  ];

  // Page token for pagination
  string page_token = 2;

  // Filter expression
  string filter = 3;

  // Order by expression
  string order_by = 4;
}

// ListTasksResponse message
message ListTasksResponse {
  // List of tasks
  repeated Task tasks = 1;

  // Token for next page
  string next_page_token = 2;

  // Total number of tasks
  int32 total_size = 3;
}

// UpdateTaskRequest message
message UpdateTaskRequest {
  // Task to update
  Task task = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];

  // Fields to update
  google.protobuf.FieldMask update_mask = 2 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).required = true
  ];
}

// DeleteTaskRequest message
message DeleteTaskRequest {
  // Resource name of the task
  string name = 1 [
    (google.api.field_behavior) = REQUIRED,
    (google.api.resource_reference) = {
      type: "todo.example.com/Task"
    }
  ];
}

// DeleteTaskResponse message
message DeleteTaskResponse {
  // Confirmation message
  string message = 1;
}

// BatchCreateTasksRequest message
message BatchCreateTasksRequest {
  // Tasks to create
  repeated CreateTaskRequest requests = 1 [
    (google.api.field_behavior) = REQUIRED,
    (buf.validate.field).repeated = {
      min_items: 1
      max_items: 100
    }
  ];
}

// BatchCreateTasksResponse message
message BatchCreateTasksResponse {
  // Created tasks
  repeated Task tasks = 1;
}
syntax = "proto3";

package errors.v1;

import "google/protobuf/timestamp.proto";
import "google/protobuf/any.proto";

option go_package = "github.com/bhatti/todo-api-errors/api/proto/errors/v1;errors";

// ErrorDetail provides a structured, machine-readable error payload.
// It is designed to be embedded in the `details` field of a `google.rpc.Status` message.
message ErrorDetail {
  // A unique, application-specific error code.
  string code = 1;
  // A short, human-readable summary of the problem type.
  string title = 2;
  // A human-readable explanation specific to this occurrence of the problem.
  string detail = 3;
  // A list of validation errors, useful for INVALID_ARGUMENT responses.
  repeated FieldViolation field_violations = 4;
  // Optional trace ID for request correlation
  string trace_id = 5;
  // Optional timestamp when the error occurred
  google.protobuf.Timestamp timestamp = 6;
  // Optional instance path where the error occurred
  string instance = 7;
  // Optional extensions for additional error context
  map<string, google.protobuf.Any> extensions = 8;
}

// Describes a single validation failure.
message FieldViolation {
  // The path to the field that failed validation, e.g., "title".
  string field = 1;
  // A developer-facing description of the validation rule that failed.
  string description = 2;
  // Application-specific error code for this validation failure
  string code = 3;
}

// AppErrorCode defines a list of standardized, application-specific error codes.
enum AppErrorCode {
  APP_ERROR_CODE_UNSPECIFIED = 0;

  // Validation failures
  VALIDATION_FAILED = 1;
  REQUIRED_FIELD = 2;
  TOO_SHORT = 3;
  TOO_LONG = 4;
  INVALID_FORMAT = 5;
  MUST_BE_FUTURE = 6;
  INVALID_VALUE = 7;
  DUPLICATE_TAG = 8;
  INVALID_TAG_FORMAT = 9;
  OVERDUE_COMPLETION = 10;
  EMPTY_BATCH = 11;
  BATCH_TOO_LARGE = 12;
  DUPLICATE_TITLE = 13;

  // Resource errors
  RESOURCE_NOT_FOUND = 1001;
  RESOURCE_CONFLICT = 1002;

  // Authentication and authorization
  AUTHENTICATION_FAILED = 2001;
  PERMISSION_DENIED = 2002;

  // Rate limiting and service availability
  RATE_LIMIT_EXCEEDED = 3001;
  SERVICE_UNAVAILABLE = 3002;

  // Internal errors
  INTERNAL_ERROR = 9001;
}

Error Handling Implementation

Now let’s implement our error handling framework:

package errors

import (
	"fmt"

	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	"google.golang.org/genproto/googleapis/rpc/errdetails"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/types/known/anypb"
	"google.golang.org/protobuf/types/known/timestamppb"
)

// AppError is our custom error type using protobuf definitions.
type AppError struct {
	GRPCCode        codes.Code
	AppCode         errorspb.AppErrorCode
	Title           string
	Detail          string
	FieldViolations []*errorspb.FieldViolation
	TraceID         string
	Instance        string
	Extensions      map[string]*anypb.Any
	CausedBy        error // For internal logging
}

func (e *AppError) Error() string {
	return fmt.Sprintf("gRPC Code: %s, App Code: %s, Title: %s, Detail: %s", e.GRPCCode, e.AppCode, e.Title, e.Detail)
}

// ToGRPCStatus converts our AppError into a gRPC status.Status.
func (e *AppError) ToGRPCStatus() *status.Status {
	st := status.New(e.GRPCCode, e.Title)

	errorDetail := &errorspb.ErrorDetail{
		Code:            e.AppCode.String(),
		Title:           e.Title,
		Detail:          e.Detail,
		FieldViolations: e.FieldViolations,
		TraceId:         e.TraceID,
		Timestamp:       timestamppb.Now(),
		Instance:        e.Instance,
		Extensions:      e.Extensions,
	}

	// For validation errors, we also attach the standard BadRequest detail
	// so that gRPC-Gateway and other standard tools can understand it.
	if e.GRPCCode == codes.InvalidArgument && len(e.FieldViolations) > 0 {
		br := &errdetails.BadRequest{}
		for _, fv := range e.FieldViolations {
			br.FieldViolations = append(br.FieldViolations, &errdetails.BadRequest_FieldViolation{
				Field:       fv.Field,
				Description: fv.Description,
			})
		}
		st, _ = st.WithDetails(br, errorDetail)
		return st
	}

	st, _ = st.WithDetails(errorDetail)
	return st
}

// Helper functions for creating common errors

func NewValidationFailed(violations []*errorspb.FieldViolation, traceID string) *AppError {
	return &AppError{
		GRPCCode:        codes.InvalidArgument,
		AppCode:         errorspb.AppErrorCode_VALIDATION_FAILED,
		Title:           "Validation Failed",
		Detail:          fmt.Sprintf("The request contains %d validation errors", len(violations)),
		FieldViolations: violations,
		TraceID:         traceID,
	}
}

func NewNotFound(resource string, id string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.NotFound,
		AppCode:  errorspb.AppErrorCode_RESOURCE_NOT_FOUND,
		Title:    "Resource Not Found",
		Detail:   fmt.Sprintf("%s with ID '%s' was not found.", resource, id),
		TraceID:  traceID,
	}
}

func NewConflict(resource, reason string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.AlreadyExists,
		AppCode:  errorspb.AppErrorCode_RESOURCE_CONFLICT,
		Title:    "Resource Conflict",
		Detail:   fmt.Sprintf("Conflict creating %s: %s", resource, reason),
		TraceID:  traceID,
	}
}

func NewInternal(message string, traceID string, causedBy error) *AppError {
	return &AppError{
		GRPCCode: codes.Internal,
		AppCode:  errorspb.AppErrorCode_INTERNAL_ERROR,
		Title:    "Internal Server Error",
		Detail:   message,
		TraceID:  traceID,
		CausedBy: causedBy,
	}
}

func NewPermissionDenied(resource, action string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.PermissionDenied,
		AppCode:  errorspb.AppErrorCode_PERMISSION_DENIED,
		Title:    "Permission Denied",
		Detail:   fmt.Sprintf("You don't have permission to %s %s", action, resource),
		TraceID:  traceID,
	}
}

func NewServiceUnavailable(message string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.Unavailable,
		AppCode:  errorspb.AppErrorCode_SERVICE_UNAVAILABLE,
		Title:    "Service Unavailable",
		Detail:   message,
		TraceID:  traceID,
	}
}

func NewRequiredField(field, message string, traceID string) *AppError {
	return &AppError{
		GRPCCode: codes.InvalidArgument,
		AppCode:  errorspb.AppErrorCode_VALIDATION_FAILED,
		Title:    "Validation Failed",
		Detail:   "The request contains validation errors",
		FieldViolations: []*errorspb.FieldViolation{
			{
				Field:       field,
				Code:        errorspb.AppErrorCode_REQUIRED_FIELD.String(),
				Description: message,
			},
		},
		TraceID: traceID,
	}
}

Validation Framework

Let’s implement validation that returns all errors at once:

package validation

import (
	"errors"
	"fmt"
	"regexp"
	"strings"

	"buf.build/gen/go/bufbuild/protovalidate/protocolbuffers/go/buf/validate"
	"buf.build/go/protovalidate"
	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"google.golang.org/protobuf/proto"
)

var pv protovalidate.Validator

func init() {
	var err error
	pv, err = protovalidate.New()
	if err != nil {
		panic(fmt.Sprintf("failed to initialize protovalidator: %v", err))
	}
}

// ValidateRequest checks a proto message and returns an AppError with all violations.
func ValidateRequest(req proto.Message, traceID string) error {
	if err := pv.Validate(req); err != nil {
		var validationErrs *protovalidate.ValidationError
		if errors.As(err, &validationErrs) {
			var violations []*errorspb.FieldViolation
			for _, violation := range validationErrs.Violations {
				fieldPath := ""
				if violation.Proto.GetField() != nil {
					fieldPath = formatFieldPath(violation.Proto.GetField())
				}

				ruleId := violation.Proto.GetRuleId()
				message := violation.Proto.GetMessage()

				violations = append(violations, &errorspb.FieldViolation{
					Field:       fieldPath,
					Description: message,
					Code:        mapConstraintToCode(ruleId),
				})
			}
			return apperrors.NewValidationFailed(violations, traceID)
		}
		return apperrors.NewInternal("Validation failed", traceID, err)
	}
	return nil
}

// ValidateTask performs additional business logic validation
func ValidateTask(task *todopb.Task, traceID string) error {
	var violations []*errorspb.FieldViolation

	// Proto validation first
	if err := ValidateRequest(task, traceID); err != nil {
		if appErr, ok := err.(*apperrors.AppError); ok {
			violations = append(violations, appErr.FieldViolations...)
		}
	}

	// Additional business rules
	if task.Status == todopb.Status_STATUS_COMPLETED && task.DueDate != nil {
		if task.UpdateTime != nil && task.UpdateTime.AsTime().After(task.DueDate.AsTime()) {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       "due_date",
				Code:        errorspb.AppErrorCode_OVERDUE_COMPLETION.String(),
				Description: "Task was completed after the due date",
			})
		}
	}

	// Validate tags format
	for i, tag := range task.Tags {
		if !isValidTag(tag) {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("tags[%d]", i),
				Code:        errorspb.AppErrorCode_INVALID_TAG_FORMAT.String(),
				Description: fmt.Sprintf("Tag '%s' must be lowercase letters, numbers, and hyphens only", tag),
			})
		}
	}

	// Check for duplicate tags
	tagMap := make(map[string]bool)
	for i, tag := range task.Tags {
		if tagMap[tag] {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("tags[%d]", i),
				Code:        errorspb.AppErrorCode_DUPLICATE_TAG.String(),
				Description: fmt.Sprintf("Tag '%s' appears multiple times", tag),
			})
		}
		tagMap[tag] = true
	}

	if len(violations) > 0 {
		return apperrors.NewValidationFailed(violations, traceID)
	}

	return nil
}

// ValidateBatchCreateTasks validates batch operations
func ValidateBatchCreateTasks(req *todopb.BatchCreateTasksRequest, traceID string) error {
	var violations []*errorspb.FieldViolation

	// Check batch size
	if len(req.Requests) == 0 {
		violations = append(violations, &errorspb.FieldViolation{
			Field:       "requests",
			Code:        errorspb.AppErrorCode_EMPTY_BATCH.String(),
			Description: "Batch must contain at least one task",
		})
	}

	if len(req.Requests) > 100 {
		violations = append(violations, &errorspb.FieldViolation{
			Field:       "requests",
			Code:        errorspb.AppErrorCode_BATCH_TOO_LARGE.String(),
			Description: fmt.Sprintf("Batch size %d exceeds maximum of 100", len(req.Requests)),
		})
	}

	// Validate each task
	for i, createReq := range req.Requests {
		if createReq.Task == nil {
			violations = append(violations, &errorspb.FieldViolation{
				Field:       fmt.Sprintf("requests[%d].task", i),
				Code:        errorspb.AppErrorCode_REQUIRED_FIELD.String(),
				Description: "Task is required",
			})
			continue
		}

		// Validate task
		if err := ValidateTask(createReq.Task, traceID); err != nil {
			if appErr, ok := err.(*apperrors.AppError); ok {
				for _, violation := range appErr.FieldViolations {
					violation.Field = fmt.Sprintf("requests[%d].task.%s", i, violation.Field)
					violations = append(violations, violation)
				}
			}
		}
	}

	// Check for duplicate titles
	titleMap := make(map[string][]int)
	for i, createReq := range req.Requests {
		if createReq.Task != nil && createReq.Task.Title != "" {
			titleMap[createReq.Task.Title] = append(titleMap[createReq.Task.Title], i)
		}
	}

	for title, indices := range titleMap {
		if len(indices) > 1 {
			for _, idx := range indices {
				violations = append(violations, &errorspb.FieldViolation{
					Field:       fmt.Sprintf("requests[%d].task.title", idx),
					Code:        errorspb.AppErrorCode_DUPLICATE_TITLE.String(),
					Description: fmt.Sprintf("Title '%s' is used by multiple tasks in the batch", title),
				})
			}
		}
	}

	if len(violations) > 0 {
		return apperrors.NewValidationFailed(violations, traceID)
	}

	return nil
}

// Helper functions
func formatFieldPath(fieldPath *validate.FieldPath) string {
	if fieldPath == nil {
		return ""
	}

	// Build field path from elements
	var parts []string
	for _, element := range fieldPath.GetElements() {
		if element.GetFieldName() != "" {
			parts = append(parts, element.GetFieldName())
		} else if element.GetFieldNumber() != 0 {
			parts = append(parts, fmt.Sprintf("field_%d", element.GetFieldNumber()))
		}
	}

	return strings.Join(parts, ".")
}

func mapConstraintToCode(ruleId string) string {
	switch {
	case strings.Contains(ruleId, "required"):
		return errorspb.AppErrorCode_REQUIRED_FIELD.String()
	case strings.Contains(ruleId, "min_len"):
		return errorspb.AppErrorCode_TOO_SHORT.String()
	case strings.Contains(ruleId, "max_len"):
		return errorspb.AppErrorCode_TOO_LONG.String()
	case strings.Contains(ruleId, "pattern"):
		return errorspb.AppErrorCode_INVALID_FORMAT.String()
	case strings.Contains(ruleId, "gt_now"):
		return errorspb.AppErrorCode_MUST_BE_FUTURE.String()
	case ruleId == "":
		return errorspb.AppErrorCode_VALIDATION_FAILED.String()
	default:
		return errorspb.AppErrorCode_INVALID_VALUE.String()
	}
}

var validTagPattern = regexp.MustCompile(`^[a-z0-9-]+$`)

func isValidTag(tag string) bool {
	return len(tag) <= 50 && validTagPattern.MatchString(tag)
}

Error Handler Middleware

Now let’s create middleware to handle errors consistently:

package middleware

import (
	"context"
	"errors"
	"log"

	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"google.golang.org/grpc"
	"google.golang.org/grpc/status"
)

// UnaryErrorInterceptor translates application errors into gRPC statuses.
func UnaryErrorInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
	resp, err := handler(ctx, req)
	if err == nil {
		return resp, nil
	}

	var appErr *apperrors.AppError
	if errors.As(err, &appErr) {
		if appErr.CausedBy != nil {
			log.Printf("ERROR: %s, Original cause: %v", appErr.Title, appErr.CausedBy)
		}
		return nil, appErr.ToGRPCStatus().Err()
	}

	if _, ok := status.FromError(err); ok {
		return nil, err // Already a gRPC status
	}

	log.Printf("UNEXPECTED ERROR: %v", err)
	return nil, apperrors.NewInternal("An unexpected error occurred", "", err).ToGRPCStatus().Err()
}
package middleware

import (
	"context"
	"encoding/json"
	"net/http"
	"runtime/debug"
	"time"

	errorspb "github.com/bhatti/todo-api-errors/api/proto/errors/v1"
	apperrors "github.com/bhatti/todo-api-errors/internal/errors"
	"github.com/google/uuid"
	"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
	"go.opentelemetry.io/otel/trace"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/encoding/protojson"
)

// HTTPErrorHandler handles errors for HTTP endpoints
func HTTPErrorHandler(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Add trace ID to context
		traceID := r.Header.Get("X-Trace-ID")
		if traceID == "" {
			traceID = uuid.New().String()
		}
		ctx := context.WithValue(r.Context(), "traceID", traceID)
		r = r.WithContext(ctx)

		// Create response wrapper to intercept errors
		wrapped := &responseWriter{
			ResponseWriter: w,
			request:        r,
			traceID:        traceID,
		}

		// Handle panics
		defer func() {
			if err := recover(); err != nil {
				handlePanic(wrapped, err)
			}
		}()

		// Process request
		next.ServeHTTP(wrapped, r)
	})
}

// responseWriter wraps http.ResponseWriter to intercept errors
type responseWriter struct {
	http.ResponseWriter
	request    *http.Request
	traceID    string
	statusCode int
	written    bool
}

func (w *responseWriter) WriteHeader(code int) {
	if !w.written {
		w.statusCode = code
		w.ResponseWriter.WriteHeader(code)
		w.written = true
	}
}

func (w *responseWriter) Write(b []byte) (int, error) {
	if !w.written {
		w.WriteHeader(http.StatusOK)
	}
	return w.ResponseWriter.Write(b)
}

// handlePanic converts panics to proper error responses
func handlePanic(w *responseWriter, recovered interface{}) {
	// Log stack trace
	debug.PrintStack()

	appErr := apperrors.NewInternal("An unexpected error occurred. Please try again later.", w.traceID, nil)
	writeErrorResponse(w, appErr)
}

// CustomHTTPError handles gRPC gateway error responses
func CustomHTTPError(ctx context.Context, mux *runtime.ServeMux,
	marshaler runtime.Marshaler, w http.ResponseWriter, r *http.Request, err error) {

	// Extract trace ID
	traceID := r.Header.Get("X-Trace-ID")
	if traceID == "" {
		if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
			traceID = span.SpanContext().TraceID().String()
		} else {
			traceID = uuid.New().String()
		}
	}

	// Convert gRPC error to HTTP response
	st, _ := status.FromError(err)

	// Check if we have our custom error detail in status details
	for _, detail := range st.Details() {
		if errorDetail, ok := detail.(*errorspb.ErrorDetail); ok {
			// Update the error detail with current request context
			errorDetail.TraceId = traceID
			errorDetail.Instance = r.URL.Path

			// Convert to JSON and write response
			w.Header().Set("Content-Type", "application/problem+json")
			w.WriteHeader(runtime.HTTPStatusFromCode(st.Code()))

			// Create a simplified JSON response that matches RFC 7807
			response := map[string]interface{}{
				"type":      getTypeForCode(errorDetail.Code),
				"title":     errorDetail.Title,
				"status":    runtime.HTTPStatusFromCode(st.Code()),
				"detail":    errorDetail.Detail,
				"instance":  errorDetail.Instance,
				"traceId":   errorDetail.TraceId,
				"timestamp": errorDetail.Timestamp,
			}

			// Add field violations if present
			if len(errorDetail.FieldViolations) > 0 {
				violations := make([]map[string]interface{}, len(errorDetail.FieldViolations))
				for i, fv := range errorDetail.FieldViolations {
					violations[i] = map[string]interface{}{
						"field":   fv.Field,
						"code":    fv.Code,
						"message": fv.Description,
					}
				}
				response["errors"] = violations
			}

			// Add extensions if present
			if len(errorDetail.Extensions) > 0 {
				extensions := make(map[string]interface{})
				for k, v := range errorDetail.Extensions {
					// Convert Any to JSON
					if jsonBytes, err := protojson.Marshal(v); err == nil {
						var jsonData interface{}
						if err := json.Unmarshal(jsonBytes, &jsonData); err == nil {
							extensions[k] = jsonData
						}
					}
				}
				if len(extensions) > 0 {
					response["extensions"] = extensions
				}
			}

			if err := json.NewEncoder(w).Encode(response); err != nil {
				http.Error(w, `{"error": "Failed to encode error response"}`, 500)
			}
			return
		}
	}

	// Fallback: create new error response
	fallbackErr := apperrors.NewInternal(st.Message(), traceID, nil)
	fallbackErr.GRPCCode = st.Code()
	writeAppErrorResponse(w, fallbackErr, r.URL.Path)
}

// Helper functions
func getTypeForCode(code string) string {
	switch code {
	case errorspb.AppErrorCode_VALIDATION_FAILED.String():
		return "https://api.example.com/errors/validation-failed"
	case errorspb.AppErrorCode_RESOURCE_NOT_FOUND.String():
		return "https://api.example.com/errors/resource-not-found"
	case errorspb.AppErrorCode_RESOURCE_CONFLICT.String():
		return "https://api.example.com/errors/resource-conflict"
	case errorspb.AppErrorCode_PERMISSION_DENIED.String():
		return "https://api.example.com/errors/permission-denied"
	case errorspb.AppErrorCode_INTERNAL_ERROR.String():
		return "https://api.example.com/errors/internal-error"
	case errorspb.AppErrorCode_SERVICE_UNAVAILABLE.String():
		return "https://api.example.com/errors/service-unavailable"
	default:
		return "https://api.example.com/errors/unknown"
	}
}

func writeErrorResponse(w http.ResponseWriter, err error) {
	if appErr, ok := err.(*apperrors.AppError); ok {
		writeAppErrorResponse(w, appErr, "")
	} else {
		http.Error(w, err.Error(), http.StatusInternalServerError)
	}
}

func writeAppErrorResponse(w http.ResponseWriter, appErr *apperrors.AppError, instance string) {
	statusCode := runtime.HTTPStatusFromCode(appErr.GRPCCode)

	response := map[string]interface{}{
		"type":      getTypeForCode(appErr.AppCode.String()),
		"title":     appErr.Title,
		"status":    statusCode,
		"detail":    appErr.Detail,
		"traceId":   appErr.TraceID,
		"timestamp": time.Now(),
	}

	if instance != "" {
		response["instance"] = instance
	}

	if len(appErr.FieldViolations) > 0 {
		violations := make([]map[string]interface{}, len(appErr.FieldViolations))
		for i, fv := range appErr.FieldViolations {
			violations[i] = map[string]interface{}{
				"field":   fv.Field,
				"code":    fv.Code,
				"message": fv.Description,
			}
		}
		response["errors"] = violations
	}

	w.Header().Set("Content-Type", "application/problem+json")
	w.WriteHeader(statusCode)
	json.NewEncoder(w).Encode(response)
}

Service Implementation

Now let’s implement our TODO service with proper error handling:

package service

import (
	"context"
	"fmt"
	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	"github.com/bhatti/todo-api-errors/internal/errors"
	"github.com/bhatti/todo-api-errors/internal/repository"
	"github.com/bhatti/todo-api-errors/internal/validation"
	"github.com/google/uuid"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/trace"
	"google.golang.org/protobuf/types/known/fieldmaskpb"
	"google.golang.org/protobuf/types/known/timestamppb"
	"strings"
)

var tracer = otel.Tracer("todo-service")

// TodoService implements the TODO API
type TodoService struct {
	todopb.UnimplementedTodoServiceServer
	repo repository.TodoRepository
}

// NewTodoService creates a new TODO service
func NewTodoService(repo repository.TodoRepository) (*TodoService, error) {
	return &TodoService{
		repo: repo,
	}, nil
}

// CreateTask creates a new task
func (s *TodoService) CreateTask(ctx context.Context, req *todopb.CreateTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "CreateTask")
	defer span.End()

	// Get trace ID for error responses
	traceID := span.SpanContext().TraceID().String()

	// Validate request
	if req.Task == nil {
		return nil, errors.NewRequiredField("task", "Task object is required", traceID)
	}

	// Validate task fields using the new validation package
	if err := validation.ValidateTask(req.Task, traceID); err != nil {
		span.SetAttributes(attribute.String("validation.error", err.Error()))
		return nil, err
	}

	// Check for duplicate title
	existing, err := s.repo.GetTaskByTitle(ctx, req.Task.Title)
	if err != nil && !repository.IsNotFound(err) {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	if existing != nil {
		return nil, errors.NewConflict("task", "A task with this title already exists", traceID)
	}

	// Generate task ID
	taskID := uuid.New().String()
	task := &todopb.Task{
		Name:        fmt.Sprintf("tasks/%s", taskID),
		Title:       req.Task.Title,
		Description: req.Task.Description,
		Status:      req.Task.Status,
		Priority:    req.Task.Priority,
		DueDate:     req.Task.DueDate,
		Tags:        req.Task.Tags,
		CreateTime:  timestamppb.Now(),
		UpdateTime:  timestamppb.Now(),
		CreatedBy:   s.getUserFromContext(ctx),
	}

	// Set defaults
	if task.Status == todopb.Status_STATUS_UNSPECIFIED {
		task.Status = todopb.Status_STATUS_PENDING
	}
	if task.Priority == todopb.Priority_PRIORITY_UNSPECIFIED {
		task.Priority = todopb.Priority_PRIORITY_MEDIUM
	}

	// Save to repository
	if err := s.repo.CreateTask(ctx, task); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	span.SetAttributes(
		attribute.String("task.id", taskID),
		attribute.String("task.title", task.Title),
	)

	return task, nil
}

// GetTask retrieves a specific task
func (s *TodoService) GetTask(ctx context.Context, req *todopb.GetTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "GetTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Extract task ID
	parts := strings.Split(req.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("name", "Task name must be in format 'tasks/{id}'", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get from repository
	task, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canAccessTask(ctx, task) {
		return nil, errors.NewPermissionDenied("task", "read", traceID)
	}

	return task, nil
}

// ListTasks retrieves all tasks
func (s *TodoService) ListTasks(ctx context.Context, req *todopb.ListTasksRequest) (*todopb.ListTasksResponse, error) {
	ctx, span := tracer.Start(ctx, "ListTasks")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Default page size
	pageSize := req.PageSize
	if pageSize == 0 {
		pageSize = 50
	}
	if pageSize > 1000 {
		pageSize = 1000
	}

	span.SetAttributes(
		attribute.Int("page.size", int(pageSize)),
		attribute.String("filter", req.Filter),
	)

	// Parse filter
	filter, err := s.parseFilter(req.Filter)
	if err != nil {
		return nil, errors.NewRequiredField("filter", fmt.Sprintf("Failed to parse filter: %v", err), traceID)
	}

	// Get tasks from repository
	tasks, nextPageToken, err := s.repo.ListTasks(ctx, repository.ListOptions{
		PageSize:  int(pageSize),
		PageToken: req.PageToken,
		Filter:    filter,
		OrderBy:   req.OrderBy,
		UserID:    s.getUserFromContext(ctx),
	})

	if err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Get total count
	totalSize, err := s.repo.CountTasks(ctx, filter, s.getUserFromContext(ctx))
	if err != nil {
		// Log but don't fail the request
		span.RecordError(err)
		totalSize = -1
	}

	return &todopb.ListTasksResponse{
		Tasks:         tasks,
		NextPageToken: nextPageToken,
		TotalSize:     int32(totalSize),
	}, nil
}

// UpdateTask updates an existing task
func (s *TodoService) UpdateTask(ctx context.Context, req *todopb.UpdateTaskRequest) (*todopb.Task, error) {
	ctx, span := tracer.Start(ctx, "UpdateTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request
	if req.Task == nil {
		return nil, errors.NewRequiredField("task", "Task object is required", traceID)
	}

	if req.UpdateMask == nil || len(req.UpdateMask.Paths) == 0 {
		return nil, errors.NewRequiredField("update_mask", "Update mask must specify which fields to update", traceID)
	}

	// Extract task ID
	parts := strings.Split(req.Task.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("task.name", "Invalid task name format", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get existing task
	existing, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canModifyTask(ctx, existing) {
		return nil, errors.NewPermissionDenied("task", "update", traceID)
	}

	// Apply updates based on field mask
	updated := s.applyFieldMask(existing, req.Task, req.UpdateMask)
	updated.UpdateTime = timestamppb.Now()

	// Validate updated task using the new validation package
	if err := validation.ValidateTask(updated, traceID); err != nil {
		return nil, err
	}

	// Save to repository
	if err := s.repo.UpdateTask(ctx, updated); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	return updated, nil
}

// DeleteTask removes a task
func (s *TodoService) DeleteTask(ctx context.Context, req *todopb.DeleteTaskRequest) (*todopb.DeleteTaskResponse, error) {
	ctx, span := tracer.Start(ctx, "DeleteTask")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate request using the new validation package
	if err := validation.ValidateRequest(req, traceID); err != nil {
		return nil, err
	}

	// Extract task ID
	parts := strings.Split(req.Name, "/")
	if len(parts) != 2 || parts[0] != "tasks" {
		return nil, errors.NewRequiredField("name", "Invalid task name format", traceID)
	}

	taskID := parts[1]
	span.SetAttributes(attribute.String("task.id", taskID))

	// Get existing task to check permissions
	existing, err := s.repo.GetTask(ctx, taskID)
	if err != nil {
		if repository.IsNotFound(err) {
			return nil, errors.NewNotFound("Task", taskID, traceID)
		}
		return nil, s.handleRepositoryError(err, traceID)
	}

	// Check permissions
	if !s.canModifyTask(ctx, existing) {
		return nil, errors.NewPermissionDenied("task", "delete", traceID)
	}

	// Delete from repository
	if err := s.repo.DeleteTask(ctx, taskID); err != nil {
		span.RecordError(err)
		return nil, s.handleRepositoryError(err, traceID)
	}

	return &todopb.DeleteTaskResponse{
		Message: fmt.Sprintf("Task %s deleted successfully", req.Name),
	}, nil
}

// BatchCreateTasks creates multiple tasks at once
func (s *TodoService) BatchCreateTasks(ctx context.Context, req *todopb.BatchCreateTasksRequest) (*todopb.BatchCreateTasksResponse, error) {
	ctx, span := tracer.Start(ctx, "BatchCreateTasks")
	defer span.End()

	traceID := span.SpanContext().TraceID().String()

	// Validate batch request using the new validation package
	if err := validation.ValidateBatchCreateTasks(req, traceID); err != nil {
		span.SetAttributes(attribute.String("validation.error", err.Error()))
		return nil, err
	}

	// Process each task
	var created []*todopb.Task
	var batchErrors []string

	for i, createReq := range req.Requests {
		task, err := s.CreateTask(ctx, createReq)
		if err != nil {
			// Collect errors for batch response
			batchErrors = append(batchErrors, fmt.Sprintf("Task %d: %s", i, err.Error()))
			continue
		}
		created = append(created, task)
	}

	// If all tasks failed, return error
	if len(created) == 0 && len(batchErrors) > 0 {
		return nil, errors.NewInternal("All batch operations failed", traceID, nil)
	}

	// Return partial success
	response := &todopb.BatchCreateTasksResponse{
		Tasks: created,
	}

	// Add partial errors to response metadata if any
	if len(batchErrors) > 0 {
		span.SetAttributes(
			attribute.Int("batch.total", len(req.Requests)),
			attribute.Int("batch.success", len(created)),
			attribute.Int("batch.failed", len(batchErrors)),
		)
	}

	return response, nil
}

// Helper methods

func (s *TodoService) handleRepositoryError(err error, traceID string) error {
	if repository.IsConnectionError(err) {
		return errors.NewServiceUnavailable("Unable to connect to the database. Please try again later.", traceID)
	}

	// Log internal error details
	span := trace.SpanFromContext(context.Background())
	if span != nil {
		span.RecordError(err)
	}

	return errors.NewInternal("An unexpected error occurred while processing your request", traceID, err)
}

func (s *TodoService) getUserFromContext(ctx context.Context) string {
	// In a real implementation, this would extract user info from auth context
	if user, ok := ctx.Value("user").(string); ok {
		return user
	}
	return "anonymous"
}

func (s *TodoService) canAccessTask(ctx context.Context, task *todopb.Task) bool {
	// In a real implementation, check if user can access this task
	user := s.getUserFromContext(ctx)
	return user == task.CreatedBy || user == "admin"
}

func (s *TodoService) canModifyTask(ctx context.Context, task *todopb.Task) bool {
	// In a real implementation, check if user can modify this task
	user := s.getUserFromContext(ctx)
	return user == task.CreatedBy || user == "admin"
}

func (s *TodoService) parseFilter(filter string) (map[string]interface{}, error) {
	// Simple filter parser - in production, use a proper parser
	parsed := make(map[string]interface{})

	if filter == "" {
		return parsed, nil
	}

	// Example: "status=COMPLETED AND priority=HIGH"
	parts := strings.Split(filter, " AND ")
	for _, part := range parts {
		kv := strings.Split(strings.TrimSpace(part), "=")
		if len(kv) != 2 {
			return nil, fmt.Errorf("invalid filter expression: %s", part)
		}

		key := strings.TrimSpace(kv[0])
		value := strings.Trim(strings.TrimSpace(kv[1]), "'\"")

		// Validate filter keys
		switch key {
		case "status", "priority", "created_by":
			parsed[key] = value
		default:
			return nil, fmt.Errorf("unknown filter field: %s", key)
		}
	}

	return parsed, nil
}

func (s *TodoService) applyFieldMask(existing, update *todopb.Task, mask *fieldmaskpb.FieldMask) *todopb.Task {
	result := *existing

	for _, path := range mask.Paths {
		switch path {
		case "title":
			result.Title = update.Title
		case "description":
			result.Description = update.Description
		case "status":
			result.Status = update.Status
		case "priority":
			result.Priority = update.Priority
		case "due_date":
			result.DueDate = update.DueDate
		case "tags":
			result.Tags = update.Tags
		}
	}
	return &result
}

Server Implementation

Now let’s put it all together in our server:

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	todopb "github.com/bhatti/todo-api-errors/api/proto/todo/v1"
	"github.com/bhatti/todo-api-errors/internal/middleware"
	"github.com/bhatti/todo-api-errors/internal/monitoring"
	"github.com/bhatti/todo-api-errors/internal/repository"
	"github.com/bhatti/todo-api-errors/internal/service"

	"github.com/grpc-ecosystem/grpc-gateway/v2/runtime"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/credentials/insecure"
	"google.golang.org/grpc/reflection"
	"google.golang.org/grpc/status"
	"google.golang.org/protobuf/encoding/protojson"
)

func main() {
	// Initialize monitoring
	if err := monitoring.InitOpenTelemetryMetrics(); err != nil {
		log.Printf("Failed to initialize OpenTelemetry metrics: %v", err)
		// Continue without OpenTelemetry - Prometheus will still work
	}

	// Initialize repository
	repo := repository.NewInMemoryRepository()

	// Initialize service
	todoService, err := service.NewTodoService(repo)
	if err != nil {
		log.Fatalf("Failed to create service: %v", err)
	}

	// Start gRPC server
	grpcPort := ":50051"
	go func() {
		if err := startGRPCServer(grpcPort, todoService); err != nil {
			log.Fatalf("Failed to start gRPC server: %v", err)
		}
	}()

	// Start HTTP gateway
	httpPort := ":8080"
	go func() {
		if err := startHTTPGateway(httpPort, grpcPort); err != nil {
			log.Fatalf("Failed to start HTTP gateway: %v", err)
		}
	}()

	// Start metrics server
	go func() {
		http.Handle("/metrics", promhttp.Handler())
		if err := http.ListenAndServe(":9090", nil); err != nil {
			log.Printf("Failed to start metrics server: %v", err)
		}
	}()

	log.Printf("TODO API server started")
	log.Printf("gRPC server listening on %s", grpcPort)
	log.Printf("HTTP gateway listening on %s", httpPort)
	log.Printf("Metrics available at :9090/metrics")

	// Wait for interrupt signal
	sigCh := make(chan os.Signal, 1)
	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
	<-sigCh

	log.Println("Shutting down...")
}

func startGRPCServer(port string, todoService todopb.TodoServiceServer) error {
	lis, err := net.Listen("tcp", port)
	if err != nil {
		return fmt.Errorf("failed to listen: %w", err)
	}

	// Create gRPC server with interceptors - now using the new UnaryErrorInterceptor
	opts := []grpc.ServerOption{
		grpc.ChainUnaryInterceptor(
			middleware.UnaryErrorInterceptor, // Using new protobuf-based error interceptor
			loggingInterceptor(),
			recoveryInterceptor(),
		),
	}

	server := grpc.NewServer(opts...)

	// Register service
	todopb.RegisterTodoServiceServer(server, todoService)

	// Register reflection for debugging
	reflection.Register(server)

	return server.Serve(lis)
}

func startHTTPGateway(httpPort, grpcPort string) error {
	ctx := context.Background()

	// Create gRPC connection
	conn, err := grpc.DialContext(
		ctx,
		"localhost"+grpcPort,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)
	if err != nil {
		return fmt.Errorf("failed to dial gRPC server: %w", err)
	}

	// Create gateway mux with custom error handler
	mux := runtime.NewServeMux(
		runtime.WithErrorHandler(middleware.CustomHTTPError), // Using new protobuf-based error handler
		runtime.WithMarshalerOption(runtime.MIMEWildcard, &runtime.JSONPb{
			MarshalOptions: protojson.MarshalOptions{
				UseProtoNames:   true,
				EmitUnpopulated: false,
			},
			UnmarshalOptions: protojson.UnmarshalOptions{
				DiscardUnknown: true,
			},
		}),
	)

	// Register service handler
	if err := todopb.RegisterTodoServiceHandler(ctx, mux, conn); err != nil {
		return fmt.Errorf("failed to register service handler: %w", err)
	}

	// Create HTTP server with middleware
	handler := middleware.HTTPErrorHandler( // Using new protobuf-based HTTP error handler
		corsMiddleware(
			authMiddleware(
				loggingHTTPMiddleware(mux),
			),
		),
	)

	server := &http.Server{
		Addr:         httpPort,
		Handler:      handler,
		ReadTimeout:  10 * time.Second,
		WriteTimeout: 10 * time.Second,
		IdleTimeout:  120 * time.Second,
	}

	return server.ListenAndServe()
}

// Middleware implementations

func loggingInterceptor() grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
		start := time.Now()

		// Call handler
		resp, err := handler(ctx, req)

		// Log request
		duration := time.Since(start)
		statusCode := "OK"
		if err != nil {
			statusCode = status.Code(err).String()
		}

		log.Printf("gRPC: %s %s %s %v", info.FullMethod, statusCode, duration, err)

		return resp, err
	}
}

func recoveryInterceptor() grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) {
		defer func() {
			if r := recover(); r != nil {
				log.Printf("Recovered from panic: %v", r)
				monitoring.RecordPanicRecovery(ctx)
				err = status.Error(codes.Internal, "Internal server error")
			}
		}()

		return handler(ctx, req)
	}
}

func loggingHTTPMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()

		// Wrap response writer to capture status
		wrapped := &statusResponseWriter{ResponseWriter: w, statusCode: http.StatusOK}

		// Process request
		next.ServeHTTP(wrapped, r)

		// Log request
		duration := time.Since(start)
		log.Printf("HTTP: %s %s %d %v", r.Method, r.URL.Path, wrapped.statusCode, duration)
	})
}

func corsMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Access-Control-Allow-Origin", "*")
		w.Header().Set("Access-Control-Allow-Methods", "GET, POST, PUT, DELETE, OPTIONS, PATCH")
		w.Header().Set("Access-Control-Allow-Headers", "Content-Type, Authorization, X-Trace-ID")

		if r.Method == "OPTIONS" {
			w.WriteHeader(http.StatusOK)
			return
		}

		next.ServeHTTP(w, r)
	})
}

func authMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// Simple auth for demo - in production use proper authentication
		authHeader := r.Header.Get("Authorization")
		if authHeader == "" {
			authHeader = "Bearer anonymous"
		}

		// Extract user from token
		user := "anonymous"
		if len(authHeader) > 7 && authHeader[:7] == "Bearer " {
			user = authHeader[7:]
		}

		// Add user to context
		ctx := context.WithValue(r.Context(), "user", user)
		next.ServeHTTP(w, r.WithContext(ctx))
	})
}

type statusResponseWriter struct {
	http.ResponseWriter
	statusCode int
}

func (w *statusResponseWriter) WriteHeader(code int) {
	w.statusCode = code
	w.ResponseWriter.WriteHeader(code)
}

Example API Usage

Let’s see our error handling in action with some example requests:

Example 1: Validation Error with Multiple Issues

Request with multiple validation errors

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "",
"description": "This description is wayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy too long…",
"status": "INVALID_STATUS",
"tags": ["INVALID TAG", "tag-1", "tag-1"]
}
}'

Response

< HTTP/1.1 422 Unprocessable Entity
< Content-Type: application/problem+json
{
  "detail": "The request contains 5 validation errors",
  "errors": [
    {
      "code": "TOO_SHORT",
      "field": "title",
      "message": "value length must be at least 1 characters"
    },
    {
      "code": "TOO_LONG",
      "field": "description",
      "message": "value length must be at most 100 characters"
    },
    {
      "code": "INVALID_FORMAT",
      "field": "tags",
      "message": "value does not match regex pattern `^[a-z0-9-]+$`"
    },
    {
      "code": "INVALID_TAG_FORMAT",
      "field": "tags[0]",
      "message": "Tag 'INVALID TAG' must be lowercase letters, numbers, and hyphens only"
    },
    {
      "code": "DUPLICATE_TAG",
      "field": "tags[2]",
      "message": "Tag 'tag-1' appears multiple times"
    }
  ],
  "instance": "/v1/tasks",
  "status": 400,
  "timestamp": {
    "seconds": 1755288524,
    "nanos": 484865000
  },
  "title": "Validation Failed",
  "traceId": "eb4bfb3f-9397-4547-8618-ce9952a16067",
  "type": "https://api.example.com/errors/validation-failed"
}

Example 2: Not Found Error

Request for non-existent task

curl http://localhost:8080/v1/tasks/non-existent-id

Response

< HTTP/1.1 404 Not Found
< Content-Type: application/problem+json
{
  "detail": "Task with ID 'non-existent-id' was not found.",
  "instance": "/v1/tasks/non-existent-id",
  "status": 404,
  "timestamp": {
    "seconds": 1755288565,
    "nanos": 904607000
  },
  "title": "Resource Not Found",
  "traceId": "6ce00cd8-d0b7-47f1-b6f6-9fc1375c26a4",
  "type": "https://api.example.com/errors/resource-not-found"
}

Example 3: Conflict Error

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "Existing Task Title"
}
}'

curl -X POST http://localhost:8080/v1/tasks \
-H "Content-Type: application/json" \
-d '{
"task": {
"title": "Existing Task Title"
}
}'

Response

< HTTP/1.1 409 Conflict
< Content-Type: application/problem+json
{
  "detail": "Conflict creating task: A task with this title already exists",
  "instance": "/v1/tasks",
  "status": 409,
  "timestamp": {
    "seconds": 1755288593,
    "nanos": 594458000
  },
  "title": "Resource Conflict",
  "traceId": "ed2e78d2-591d-492a-8d71-6b6843ce86f7",
  "type": "https://api.example.com/errors/resource-conflict"
}

Example 4: Service Unavailable (Transient Error)

When database is down

curl http://localhost:8080/v1/tasks

Response

HTTP/1.1 503 Service Unavailable
Content-Type: application/problem+json
Retry-After: 30
{
  "type": "https://api.example.com/errors/service-unavailable",
  "title": "Service Unavailable",
  "status": 503,
  "detail": "Database connection pool exhausted. Please try again later.",
  "instance": "/v1/tasks",
  "traceId": "db-pool-001",
  "timestamp": "2025-08-15T10:30:00Z",
  "extensions": {
    "retryable": true,
    "retryAfter": "2025-08-15T10:30:30Z",
    "maxRetries": 3,
    "backoffType": "exponential",
    "backoffMs": 1000,
    "errorCategory": "database"
  }
}

Best Practices Summary

Our implementation demonstrates several key best practices:

1. Consistent Error Format

All errors follow RFC 9457 (Problem Details) format, providing:

  • Machine-readable type URIs
  • Human-readable titles and details
  • HTTP status codes
  • Request tracing
  • Extensible metadata

2. Comprehensive Validation

  • All validation errors are returned at once, not one by one
  • Clear field paths for nested objects
  • Descriptive error codes and messages
  • Support for batch operations with partial success

3. Security-Conscious Design

  • No sensitive information in error messages
  • Internal errors are logged but not exposed
  • Generic messages for authentication failures
  • Request IDs for support without exposing internals

4. Developer Experience

  • Clear, actionable error messages
  • Helpful suggestions for fixing issues
  • Consistent error codes across protocols
  • Rich metadata for debugging

5. Protocol Compatibility

  • Seamless translation between gRPC and HTTP
  • Proper status code mapping
  • Preservation of error details across protocols

6. Observability

  • Structured logging with trace IDs
  • Prometheus metrics for monitoring
  • OpenTelemetry integration
  • Error categorization for analysis

Conclusion

This comprehensive guide demonstrates how to build robust error handling for modern APIs. By treating errors as a first-class feature of our API, we’ve achieved several key benefits:

  • Consistency: All errors, regardless of their source, are presented to clients in a predictable format.
  • Clarity: Developers consuming our API get clear, actionable feedback, helping them debug and integrate faster.
  • Developer Ergonomics: Our internal service code is cleaner, as handlers focus on business logic while the middleware handles the boilerplate of error conversion.
  • Security: We have a clear separation between internal error details (for logging) and public error responses, preventing leaks.

Additional Resources

You can find the full source code for this example in this GitHub repository.

July 17, 2025

Zero-Downtime Services with Lifecycle Management on Kubernetes and Istio

Filed under: Computing,Web Services — admin @ 3:12 pm

Introduction

In the world of cloud-native applications, service lifecycle management is often an afterthought—until it causes a production outage. Whether you’re running gRPC or REST APIs on Kubernetes with Istio, proper lifecycle management is the difference between smooth deployments and 3 AM incident calls. Consider these scenarios:

  • Your service takes 45 seconds to warm up its cache, but Kubernetes kills it after 30 seconds of startup wait.
  • During deployments, clients receive connection errors as pods terminate abruptly.
  • A hiccup in a database or dependent service causes your entire service mesh to cascade fail.
  • Your service mesh sidecar shuts down before your application is terminated or drops in-flight requests.
  • A critical service receives SIGKILL during transaction processing, leaving data in inconsistent states.
  • After a regional outage, services restart but data drift goes undetected for hours.
  • Your RTO target is 15 seconds, but services take 30 seconds just to start up properly.

These aren’t edge cases—they’re common problems that proper lifecycle management solves. More critically, unsafe shutdowns can cause data corruption, financial losses, and breach compliance requirements. This guide covers what you need to know about building services that start safely, shut down gracefully, and handle failures intelligently.

The Hidden Complexity of Service Lifecycles

Modern microservices don’t exist in isolation. A typical request might flow through:

Typical Request Flow.

Each layer adds complexity to startup and shutdown sequences. Without proper coordination, you’ll experience:

  • Startup race conditions: Application tries to make network calls before the sidecar proxy is ready
  • Shutdown race conditions: Sidecar terminates while the application is still processing requests
  • Premature traffic: Load balancer routes traffic before the application is truly ready
  • Dropped connections: Abrupt shutdowns leave clients hanging
  • Data corruption: In-flight transactions get interrupted, leaving databases in inconsistent states
  • Compliance violations: Financial services may face regulatory penalties for data integrity failures

Core Concepts: The Three Types of Health Checks

Kubernetes provides three distinct probe types, each serving a specific purpose:

1. Liveness Probe: “Is the process alive?”

  • Detects deadlocks and unrecoverable states
  • Should be fast and simple (e.g., HTTP GET /healthz)
  • Failure triggers container restart
  • Common mistake: Making this check too complex

2. Readiness Probe: “Can the service handle traffic?”

  • Validates all critical dependencies are available
  • Prevents routing traffic to pods that aren’t ready
  • Should perform “deep” checks of dependencies
  • Common mistake: Using the same check as liveness

3. Startup Probe: “Is the application still initializing?”

  • Provides grace period for slow-starting containers
  • Disables liveness/readiness probes until successful
  • Prevents restart loops during initialization
  • Common mistake: Not using it for slow-starting apps

The Hidden Dangers of Unsafe Shutdowns

While graceful shutdown is ideal, it’s not always possible. Kubernetes will send SIGKILL after the termination grace period, and infrastructure failures can terminate pods instantly. This creates serious risks:

Data Corruption Scenarios

Financial Transaction Example:

// DANGEROUS: Non-atomic operation
func (s *PaymentService) ProcessPayment(req *PaymentRequest) error {
    // Step 1: Debit source account
    if err := s.debitAccount(req.FromAccount, req.Amount); err != nil {
        return err
    }
    
    // ???? SIGKILL here leaves money debited but not credited
    // Step 2: Credit destination account  
    if err := s.creditAccount(req.ToAccount, req.Amount); err != nil {
        // Money is lost! Source debited but destination not credited
        return err
    }
    
    // Step 3: Record transaction
    return s.recordTransaction(req)
}

E-commerce Inventory Example:

// DANGEROUS: Race condition during shutdown
func (s *InventoryService) ReserveItem(req *ReserveRequest) error {
    // Check availability
    if s.getStock(req.ItemID) < req.Quantity {
        return ErrInsufficientStock
    }
    
    // ???? SIGKILL here can cause double-reservation
    // Another request might see the same stock level
    
    // Reserve the item
    return s.updateStock(req.ItemID, -req.Quantity)
}

RTO/RPO Impact

Recovery Time Objective (RTO): How quickly can we restore service?

  • Poor lifecycle management increases startup time
  • Services may need manual intervention to reach consistent state
  • Cascading failures extend recovery time across the entire system

Recovery Point Objective (RPO): How much data can we afford to lose?

  • Unsafe shutdowns can corrupt recent transactions
  • Without idempotency, replay of messages may create duplicates
  • Data inconsistencies may not be detected until much later

The Anti-Entropy Solution

Since graceful shutdown isn’t always possible, production systems need reconciliation processes to detect and repair inconsistencies:

// Anti-entropy pattern for data consistency
type ReconciliationService struct {
    paymentDB    PaymentDatabase
    accountDB    AccountDatabase
    auditLog     AuditLogger
    alerting     AlertingService
}

func (r *ReconciliationService) ReconcilePayments(ctx context.Context) error {
    // Find payments without matching account entries
    orphanedPayments, err := r.paymentDB.FindOrphanedPayments(ctx)
    if err != nil {
        return err
    }
    
    for _, payment := range orphanedPayments {
        // Check if this was a partial transaction
        sourceDebit, _ := r.accountDB.GetTransaction(payment.FromAccount, payment.ID)
        destCredit, _ := r.accountDB.GetTransaction(payment.ToAccount, payment.ID)
        
        switch {
        case sourceDebit != nil && destCredit == nil:
            // Complete the transaction
            if err := r.creditAccount(payment.ToAccount, payment.Amount); err != nil {
                r.alerting.SendAlert("Failed to complete orphaned payment", payment.ID)
                continue
            }
            r.auditLog.RecordReconciliation("completed_payment", payment.ID)
            
        case sourceDebit == nil && destCredit != nil:
            // Reverse the credit
            if err := r.debitAccount(payment.ToAccount, payment.Amount); err != nil {
                r.alerting.SendAlert("Failed to reverse orphaned credit", payment.ID)
                continue
            }
            r.auditLog.RecordReconciliation("reversed_credit", payment.ID)
            
        default:
            // Both or neither exist - needs investigation
            r.alerting.SendAlert("Ambiguous payment state", payment.ID)
        }
    }
    
    return nil
}

// Run reconciliation periodically
func (r *ReconciliationService) Start(ctx context.Context) {
    ticker := time.NewTicker(5 * time.Minute)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            if err := r.ReconcilePayments(ctx); err != nil {
                log.Printf("Reconciliation failed: %v", err)
            }
        }
    }
}

Building a Resilient Service: Complete Example

Let’s build a production-ready service that demonstrates all best practices. We’ll create two versions: one with anti-patterns (bad-service) and one with best practices (good-service).

Sequence diagram of a typical API with proper Kubernetes and Istio configuration.

The Application Code

//go:generate protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative api/demo.proto

package main

import (
    "context"
    "flag"
    "fmt"
    "log"
    "net"
    "net/http"
    "os"
    "os/signal"
    "sync/atomic"
    "syscall"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    health "google.golang.org/grpc/health/grpc_health_v1"
    "google.golang.org/grpc/status"
)

// Service represents our application with health state
type Service struct {
    isHealthy         atomic.Bool
    isShuttingDown    atomic.Bool
    activeRequests    atomic.Int64
    dependencyHealthy atomic.Bool
}

// HealthChecker implements the gRPC health checking protocol
type HealthChecker struct {
    svc *Service
}

func (h *HealthChecker) Check(ctx context.Context, req *health.HealthCheckRequest) (*health.HealthCheckResponse, error) {
    service := req.GetService()
    
    // Liveness: Simple check - is the process responsive?
    if service == "" || service == "liveness" {
        if h.svc.isShuttingDown.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    // Readiness: Deep check - can we handle traffic?
    if service == "readiness" {
        // Check application health
        if !h.svc.isHealthy.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        // Check critical dependencies
        if !h.svc.dependencyHealthy.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        // Check if shutting down
        if h.svc.isShuttingDown.Load() {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    // Synthetic readiness: Complex business logic check for monitoring
    if service == "synthetic-readiness" {
        // Simulate a complex health check that validates business logic
        // This would make actual API calls, database queries, etc.
        if !h.performSyntheticCheck(ctx) {
            return &health.HealthCheckResponse{
                Status: health.HealthCheckResponse_NOT_SERVING,
            }, nil
        }
        return &health.HealthCheckResponse{
            Status: health.HealthCheckResponse_SERVING,
        }, nil
    }
    
    return nil, status.Errorf(codes.NotFound, "unknown service: %s", service)
}

func (h *HealthChecker) performSyntheticCheck(ctx context.Context) bool {
    // In a real service, this would:
    // 1. Create a test transaction
    // 2. Query the database
    // 3. Call dependent services
    // 4. Validate the complete flow works
    return h.svc.isHealthy.Load() && h.svc.dependencyHealthy.Load()
}

func (h *HealthChecker) Watch(req *health.HealthCheckRequest, server health.Health_WatchServer) error {
    return status.Error(codes.Unimplemented, "watch not implemented")
}

// DemoServiceServer implements your business logic
type DemoServiceServer struct {
    UnimplementedDemoServiceServer
    svc *Service
}

func (s *DemoServiceServer) ProcessRequest(ctx context.Context, req *ProcessRequest) (*ProcessResponse, error) {
    s.svc.activeRequests.Add(1)
    defer s.svc.activeRequests.Add(-1)
    
    // Simulate processing
    select {
    case <-ctx.Done():
        return nil, ctx.Err()
    case <-time.After(100 * time.Millisecond):
        return &ProcessResponse{
            Result: fmt.Sprintf("Processed: %s", req.GetData()),
        }, nil
    }
}

func main() {
    var (
        port         = flag.Int("port", 8080, "gRPC port")
        mgmtPort     = flag.Int("mgmt-port", 8090, "Management port")
        startupDelay = flag.Duration("startup-delay", 10*time.Second, "Startup delay")
    )
    flag.Parse()
    
    svc := &Service{}
    svc.dependencyHealthy.Store(true) // Assume healthy initially
    
    // Management endpoints for testing
    mux := http.NewServeMux()
    mux.HandleFunc("/toggle-health", func(w http.ResponseWriter, r *http.Request) {
        current := svc.dependencyHealthy.Load()
        svc.dependencyHealthy.Store(!current)
        fmt.Fprintf(w, "Dependency health toggled to: %v\n", !current)
    })
    mux.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "active_requests %d\n", svc.activeRequests.Load())
        fmt.Fprintf(w, "is_healthy %v\n", svc.isHealthy.Load())
        fmt.Fprintf(w, "is_shutting_down %v\n", svc.isShuttingDown.Load())
    })
    
    mgmtServer := &http.Server{
        Addr:    fmt.Sprintf(":%d", *mgmtPort),
        Handler: mux,
    }
    
    // Start management server
    go func() {
        log.Printf("Management server listening on :%d", *mgmtPort)
        if err := mgmtServer.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("Management server failed: %v", err)
        }
    }()
    
    // Simulate slow startup
    log.Printf("Starting application (startup delay: %v)...", *startupDelay)
    time.Sleep(*startupDelay)
    svc.isHealthy.Store(true)
    log.Println("Application initialized and ready")
    
    // Setup gRPC server
    lis, err := net.Listen("tcp", fmt.Sprintf(":%d", *port))
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }
    
    grpcServer := grpc.NewServer()
    RegisterDemoServiceServer(grpcServer, &DemoServiceServer{svc: svc})
    health.RegisterHealthServer(grpcServer, &HealthChecker{svc: svc})
    
    // Start gRPC server
    go func() {
        log.Printf("gRPC server listening on :%d", *port)
        if err := grpcServer.Serve(lis); err != nil {
            log.Fatalf("gRPC server failed: %v", err)
        }
    }()
    
    // Wait for shutdown signal
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
    sig := <-sigCh
    
    log.Printf("Received signal: %v, starting graceful shutdown...", sig)
    
    // Graceful shutdown sequence
    svc.isShuttingDown.Store(true)
    svc.isHealthy.Store(false) // Fail readiness immediately
    
    // Stop accepting new requests
    grpcServer.GracefulStop()
    
    // Wait for active requests to complete
    timeout := time.After(30 * time.Second)
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()
    
    for {
        select {
        case <-timeout:
            log.Println("Shutdown timeout reached, forcing exit")
            os.Exit(1)
        case <-ticker.C:
            active := svc.activeRequests.Load()
            if active == 0 {
                log.Println("All requests completed")
                goto shutdown
            }
            log.Printf("Waiting for %d active requests to complete...", active)
        }
    }
    
shutdown:
    // Cleanup
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    mgmtServer.Shutdown(ctx)
    
    log.Println("Graceful shutdown complete")
}

Kubernetes Manifests: Anti-Patterns vs Best Practices

Bad Service (Anti-Patterns)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bad-service
  namespace: demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: bad-service
  template:
    metadata:
      labels:
        app: bad-service
      # MISSING: Critical Istio annotations!
    spec:
      # DEFAULT: Only 30s grace period
      containers:
      - name: app
        image: myregistry/demo-service:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: mgmt
        args: ["--startup-delay=45s"]  # Longer than default probe timeout!
        
        # ANTI-PATTERN: Identical liveness and readiness probes
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080"]
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 3  # Will fail after 40s total
          
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080"]  # Same as liveness!
          initialDelaySeconds: 10
          periodSeconds: 10
        
        # MISSING: No startup probe for slow initialization
        # MISSING: No preStop hook for graceful shutdown

Good Service (Best Practices)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: good-service
  namespace: demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: good-service
  template:
    metadata:
      labels:
        app: good-service
      annotations:
        # Critical for Istio/Envoy sidecar lifecycle management
        sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
        proxy.istio.io/config: |
          proxyMetadata:
            EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"
        sidecar.istio.io/proxyCPU: "100m"
        sidecar.istio.io/proxyMemory: "128Mi"
    spec:
      # Extended grace period: preStop (15s) + app shutdown (30s) + buffer (20s)
      terminationGracePeriodSeconds: 65
      
      containers:
      - name: app
        image: myregistry/demo-service:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: mgmt
        args: ["--startup-delay=45s"]
        
        # Resource management for predictable performance
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        
        # Startup probe for slow initialization
        startupProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=readiness"]
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 24  # 5s * 24 = 120s total startup time
          successThreshold: 1
        
        # Simple liveness check
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=liveness"]
          initialDelaySeconds: 0  # Startup probe handles initialization
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        
        # Deep readiness check
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:8080", "-service=readiness"]
          initialDelaySeconds: 0
          periodSeconds: 5
          failureThreshold: 2
          successThreshold: 1
          timeoutSeconds: 5
        
        # Graceful shutdown coordination
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Allow LB to drain
        
        # Environment variables for cloud provider integration
        env:
        - name: CLOUD_PROVIDER
          value: "auto-detect"  # Works with GCP, AWS, Azure
        - name: ENABLE_PROFILING
          value: "true"

Istio Service Mesh: Beyond Basic Lifecycle Management

While proper health checks and graceful shutdown are foundational, Istio adds critical production-grade capabilities that dramatically improve fault tolerance:

Automatic Retries and Circuit Breaking

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
  namespace: demo
spec:
  host: payment-service.demo.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 2
    circuitBreaker:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    retryPolicy:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,gateway-error,connect-failure,refused-stream
      retryRemoteLocalities: true

Key Benefits for Production Systems

  1. Automatic Request Retries: If a pod fails or becomes unavailable, Istio automatically retries requests to healthy instances
  2. Circuit Breaking: Prevents cascading failures by temporarily cutting off traffic to unhealthy services
  3. Load Balancing: Distributes traffic intelligently across healthy pods
  4. Mutual TLS: Secures service-to-service communication without code changes
  5. Observability: Provides detailed metrics, tracing, and logging for all inter-service communication
  6. Canary Deployments: Enables safe rollouts with automatic traffic shifting
  7. Rate Limiting: Protects services from being overwhelmed
  8. Timeout Management: Prevents hanging requests with configurable timeouts

Termination Grace Period Calculation

The critical formula for calculating termination grace periods:

terminationGracePeriodSeconds = preStop delay + application shutdown timeout + buffer

Examples:
- Simple service: 10s + 20s + 5s = 35s
- Complex service: 15s + 45s + 5s = 65s
- Batch processor: 30s + 120s + 10s = 160s

Important: Services requiring more than 90-120 seconds to shut down should be re-architected using checkpoint-and-resume patterns.

Advanced Patterns for Production

1. Idempotency: Handling Duplicate Requests

Critical for production: When pods restart or network issues occur, clients may retry requests. Without idempotency, this can cause duplicate transactions, corrupted state, or financial losses. This is mandatory for all state-modifying operations.

package idempotency

import (
    "context"
    "crypto/sha256"
    "encoding/hex"
    "time"
    "sync"
    "errors"
)

var (
    ErrDuplicateRequest = errors.New("duplicate request detected")
    ErrProcessingInProgress = errors.New("request is currently being processed")
)

// IdempotencyStore tracks request execution with persistence
type IdempotencyStore struct {
    mu        sync.RWMutex
    records   map[string]*Record
    persister PersistenceLayer // Database or Redis for durability
}

type Record struct {
    Key         string
    Response    interface{}
    Error       error
    Status      ProcessingStatus
    ExpiresAt   time.Time
    CreatedAt   time.Time
    ProcessedAt *time.Time
}

type ProcessingStatus int

const (
    StatusPending ProcessingStatus = iota
    StatusProcessing
    StatusCompleted
    StatusFailed
)

// ProcessIdempotent ensures exactly-once processing semantics
func (s *IdempotencyStore) ProcessIdempotent(
    ctx context.Context,
    key string,
    ttl time.Duration,
    fn func() (interface{}, error),
) (interface{}, error) {
    // Check if we've seen this request before
    s.mu.RLock()
    record, exists := s.records[key]
    s.mu.RUnlock()
    
    if exists {
        switch record.Status {
        case StatusCompleted:
            if time.Now().Before(record.ExpiresAt) {
                return record.Response, record.Error
            }
        case StatusProcessing:
            return nil, ErrProcessingInProgress
        case StatusFailed:
            if time.Now().Before(record.ExpiresAt) {
                return record.Response, record.Error
            }
        }
    }
    
    // Mark as processing
    record = &Record{
        Key:       key,
        Status:    StatusProcessing,
        ExpiresAt: time.Now().Add(ttl),
        CreatedAt: time.Now(),
    }
    
    s.mu.Lock()
    s.records[key] = record
    s.mu.Unlock()
    
    // Persist the processing state
    if err := s.persister.Save(ctx, record); err != nil {
        return nil, err
    }
    
    // Execute the function
    response, err := fn()
    processedAt := time.Now()
    
    // Update record with result
    s.mu.Lock()
    record.Response = response
    record.Error = err
    record.ProcessedAt = &processedAt
    if err != nil {
        record.Status = StatusFailed
    } else {
        record.Status = StatusCompleted
    }
    s.mu.Unlock()
    
    // Persist the final state
    s.persister.Save(ctx, record)
    
    return response, err
}

// Example: Idempotent payment processing
func (s *PaymentService) ProcessPayment(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
    // Generate idempotency key from request
    key := generateIdempotencyKey(req)
    
    result, err := s.idempotencyStore.ProcessIdempotent(
        ctx,
        key,
        24*time.Hour, // Keep records for 24 hours
        func() (interface{}, error) {
            // Atomic transaction processing
            return s.processPaymentTransaction(ctx, req)
        },
    )
    
    if err != nil {
        return nil, err
    }
    return result.(*PaymentResponse), nil
}

// Atomic transaction processing
func (s *PaymentService) processPaymentTransaction(ctx context.Context, req *PaymentRequest) (*PaymentResponse, error) {
    // Use database transaction for atomicity
    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return nil, err
    }
    defer tx.Rollback()
    
    // Step 1: Validate accounts
    if err := s.validateAccounts(ctx, tx, req); err != nil {
        return nil, err
    }
    
    // Step 2: Process payment atomically
    paymentID, err := s.executePayment(ctx, tx, req)
    if err != nil {
        return nil, err
    }
    
    // Step 3: Commit transaction
    if err := tx.Commit(); err != nil {
        return nil, err
    }
    
    return &PaymentResponse{
        PaymentID: paymentID,
        Status:    "completed",
        Timestamp: time.Now(),
    }, nil
}

2. Checkpoint and Resume: Long-Running Operations

For operations that may exceed the termination grace period, implement checkpointing:

package checkpoint

import (
    "context"
    "encoding/json"
    "time"
)

type CheckpointStore interface {
    Save(ctx context.Context, id string, state interface{}) error
    Load(ctx context.Context, id string, state interface{}) error
    Delete(ctx context.Context, id string) error
}

type BatchProcessor struct {
    store          CheckpointStore
    checkpointFreq int
}

type BatchState struct {
    JobID      string    `json:"job_id"`
    TotalItems int       `json:"total_items"`
    Processed  int       `json:"processed"`
    LastItem   string    `json:"last_item"`
    StartedAt  time.Time `json:"started_at"`
}

func (p *BatchProcessor) ProcessBatch(ctx context.Context, jobID string, items []string) error {
    // Try to resume from checkpoint
    state := &BatchState{JobID: jobID}
    if err := p.store.Load(ctx, jobID, state); err == nil {
        log.Printf("Resuming job %s from item %d", jobID, state.Processed)
        items = items[state.Processed:]
    } else {
        // New job
        state = &BatchState{
            JobID:      jobID,
            TotalItems: len(items),
            Processed:  0,
            StartedAt:  time.Now(),
        }
    }
    
    // Process items with periodic checkpointing
    for i, item := range items {
        select {
        case <-ctx.Done():
            // Save progress before shutting down
            state.LastItem = item
            return p.store.Save(ctx, jobID, state)
        default:
            // Process item
            if err := p.processItem(ctx, item); err != nil {
                return err
            }
            
            state.Processed++
            state.LastItem = item
            
            // Checkpoint periodically
            if state.Processed%p.checkpointFreq == 0 {
                if err := p.store.Save(ctx, jobID, state); err != nil {
                    log.Printf("Failed to checkpoint: %v", err)
                }
            }
        }
    }
    
    // Job completed, remove checkpoint
    return p.store.Delete(ctx, jobID)
}

3. Circuit Breaker Pattern for Dependencies

Protect your service from cascading failures:

package circuitbreaker

import (
    "context"
    "sync"
    "time"
)

type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu              sync.RWMutex
    state           State
    failures        int
    successes       int
    lastFailureTime time.Time
    
    maxFailures      int
    resetTimeout     time.Duration
    halfOpenRequests int
}

func (cb *CircuitBreaker) Call(ctx context.Context, fn func() error) error {
    cb.mu.RLock()
    state := cb.state
    cb.mu.RUnlock()
    
    if state == StateOpen {
        // Check if we should transition to half-open
        cb.mu.Lock()
        if time.Since(cb.lastFailureTime) > cb.resetTimeout {
            cb.state = StateHalfOpen
            cb.successes = 0
            state = StateHalfOpen
        }
        cb.mu.Unlock()
    }
    
    if state == StateOpen {
        return ErrCircuitOpen
    }
    
    err := fn()
    
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    if err != nil {
        cb.failures++
        cb.lastFailureTime = time.Now()
        
        if cb.failures >= cb.maxFailures {
            cb.state = StateOpen
            log.Printf("Circuit breaker opened after %d failures", cb.failures)
        }
        return err
    }
    
    if state == StateHalfOpen {
        cb.successes++
        if cb.successes >= cb.halfOpenRequests {
            cb.state = StateClosed
            cb.failures = 0
            log.Println("Circuit breaker closed")
        }
    }
    
    return nil
}

Testing Your Implementation

Manual Testing Guide

Test 1: Startup Race Condition

Setup:

# Deploy both services
kubectl apply -f k8s/bad-service.yaml
kubectl apply -f k8s/good-service.yaml

# Watch pods in separate terminal
watch kubectl get pods -n demo

Test the bad service:

# Force restart
kubectl delete pod -l app=bad-service -n demo

# Observe: Pod will enter CrashLoopBackOff due to liveness probe
# killing it before 45s startup completes

Test the good service:

# Force restart
kubectl delete pod -l app=good-service -n demo

# Observe: Pod stays in 0/1 Ready state for ~45s, then becomes ready
# No restarts occur thanks to startup probe

Test 2: Data Consistency Under Failure

Setup:

# Deploy payment service with reconciliation enabled
kubectl apply -f k8s/payment-service.yaml

# Start payment traffic generator
kubectl run payment-generator --image=payment-client:latest \
  --restart=Never --rm -it -- \
  --target=payment-service.demo.svc.cluster.local:8080 \
  --rate=10 --duration=60s

Simulate SIGKILL during transactions:

# In another terminal, kill pods abruptly
while true; do
  kubectl delete pod -l app=payment-service -n demo --force --grace-period=0
  sleep 30
done

Verify reconciliation:

# Check for data inconsistencies
kubectl logs -l app=payment-service -n demo | grep "inconsistency"

# Monitor reconciliation metrics
kubectl port-forward svc/payment-service 8090:8090
curl http://localhost:8090/metrics | grep consistency

Test 3: RTO/RPO Validation

Disaster Recovery Simulation:

# Simulate regional failure
kubectl patch deployment payment-service -n demo \
  --patch '{"spec":{"replicas":0}}'

# Measure RTO - time to restore service
start_time=$(date +%s)
kubectl patch deployment payment-service -n demo \
  --patch '{"spec":{"replicas":3}}'

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod -l app=payment-service -n demo --timeout=900s
end_time=$(date +%s)
rto=$((end_time - start_time))

echo "RTO: ${rto} seconds"
if [ $rto -le 900 ]; then
  echo "? RTO target met (15 minutes)"
else
  echo "? RTO target exceeded"
fi

Test 4: Istio Resilience Features

Automatic Retry Testing:

# Deploy with fault injection
kubectl apply -f istio/fault-injection.yaml

# Generate requests with chaos header
for i in {1..100}; do
  grpcurl -H "x-chaos-test: true" -plaintext \
    payment-service.demo.svc.cluster.local:8080 \
    PaymentService/ProcessPayment \
    -d '{"amount": 100, "currency": "USD"}'
done

# Check Istio metrics for retry behavior
kubectl exec -n istio-system deployment/istiod -- \
  pilot-agent request GET stats/prometheus | grep retry

Monitoring and Observability

RTO/RPO Considerations

Recovery Time Objective (RTO): Target time to restore service after an outage Recovery Point Objective (RPO): Maximum acceptable data loss

Your service lifecycle design directly impacts these critical business metrics:

package monitoring

import (
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // RTO-related metrics
    ServiceStartupTime = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "service_startup_duration_seconds",
        Help: "Time from pod start to service ready",
        Buckets: []float64{1, 5, 10, 30, 60, 120, 300, 600}, // Up to 10 minutes
    }, []string{"service", "version"})
    
    ServiceRecoveryTime = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name: "service_recovery_duration_seconds", 
        Help: "Time to recover from failure state",
        Buckets: []float64{1, 5, 10, 30, 60, 300, 900}, // Up to 15 minutes
    }, []string{"service", "failure_type"})
    
    // RPO-related metrics
    LastCheckpointAge = promauto.NewGaugeVec(prometheus.GaugeOpts{
        Name: "last_checkpoint_age_seconds",
        Help: "Age of last successful checkpoint",
    }, []string{"service", "checkpoint_type"})
    
    DataConsistencyChecks = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "data_consistency_checks_total",
        Help: "Total number of consistency checks performed",
    }, []string{"service", "check_type", "status"})
    
    InconsistencyDetected = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "data_inconsistencies_detected_total",
        Help: "Total number of data inconsistencies detected",
    }, []string{"service", "inconsistency_type", "severity"})
)

Grafana Dashboard

{
  "dashboard": {
    "title": "Service Lifecycle - Business Impact",
    "panels": [
      {
        "title": "RTO Compliance",
        "description": "Percentage of recoveries meeting RTO target (15 minutes)",
        "targets": [{
          "expr": "100 * (histogram_quantile(0.95, service_recovery_duration_seconds_bucket) <= 900)"
        }],
        "thresholds": [
          {"value": 95, "color": "green"},
          {"value": 90, "color": "yellow"},
          {"value": 0, "color": "red"}
        ]
      },
      {
        "title": "RPO Risk Assessment",
        "description": "Data at risk based on checkpoint age",
        "targets": [{
          "expr": "last_checkpoint_age_seconds / 60"
        }],
        "unit": "minutes"
      },
      {
        "title": "Data Consistency Status",
        "targets": [{
          "expr": "rate(data_inconsistencies_detected_total[5m])"
        }]
      }
    ]
  }
}

Production Readiness Checklist

Before deploying to production, ensure your service meets these criteria:

Application Layer

  • [ ] Implements separate liveness and readiness endpoints
  • [ ] Readiness checks validate all critical dependencies
  • [ ] Graceful shutdown drains in-flight requests
  • [ ] Idempotency for all state-modifying operations
  • [ ] Anti-entropy/reconciliation processes implemented
  • [ ] Circuit breakers for external dependencies
  • [ ] Checkpoint-and-resume for long-running operations
  • [ ] Structured logging with correlation IDs
  • [ ] Metrics for startup, shutdown, and health status

Kubernetes Configuration

  • [ ] Startup probe for slow-initializing services
  • [ ] Distinct liveness and readiness probes
  • [ ] Calculated terminationGracePeriodSeconds based on actual shutdown time
  • [ ] PreStop hooks for load balancer draining
  • [ ] Resource requests and limits defined
  • [ ] PodDisruptionBudget for availability
  • [ ] Anti-affinity rules for high availability

Service Mesh Integration

  • [ ] Istio sidecar lifecycle annotations (holdApplicationUntilProxyStarts)
  • [ ] Istio automatic retry policies configured
  • [ ] Circuit breaker configuration in DestinationRule
  • [ ] Distributed tracing enabled
  • [ ] mTLS for service-to-service communication

Data Integrity & Recovery

  • [ ] RTO/RPO metrics tracked and alerting configured
  • [ ] Reconciliation processes tested with Game Day exercises
  • [ ] Chaos engineering tests validate failure scenarios
  • [ ] Synthetic monitoring for end-to-end business flows
  • [ ] Backup and restore procedures documented and tested

Common Pitfalls and Solutions

1. My service keeps restarting during deployment:

Symptom: Pods enter CrashLoopBackOff during rollout

Common Causes:

  • Liveness probe starts before application is ready
  • Startup time exceeds probe timeout
  • Missing startup probe

Solution:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30  # 30 * 10s = 5 minutes
  periodSeconds: 10

2. Data corruption during pod restarts:

Symptom: Inconsistent database state after deployments

Common Causes:

  • Non-atomic operations
  • Missing idempotency
  • No reconciliation processes

Solution:

// Implement atomic operations with database transactions
tx, err := db.BeginTx(ctx, nil)
if err != nil {
    return err
}
defer tx.Rollback()

// All operations within transaction
if err := processPayment(tx, req); err != nil {
    return err // Automatic rollback
}

return tx.Commit()

3. Service mesh sidecar issues:

Symptom: ECONNREFUSED errors on startup

Common Causes:

  • Application starts before sidecar is ready
  • Sidecar terminates before application

Solution:

annotations:
  sidecar.istio.io/holdApplicationUntilProxyStarts: "true"
  proxy.istio.io/config: |
    proxyMetadata:
      EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true"

Conclusion

Service lifecycle management is not just about preventing outages—it’s about building systems that are predictable, observable, and resilient to the inevitable failures that occur in distributed systems. This allows:

  • Zero-downtime deployments: Services gracefully handle rollouts without data loss.
  • Improved reliability: Proper health checks prevent cascading failures.
  • Better observability: Clear signals about service state and data consistency.
  • Faster recovery: Services self-heal from transient failures.
  • Data integrity: Idempotency and reconciliation prevent corruption.
  • Compliance readiness: Meet RTO/RPO requirements for disaster recovery.
  • Financial protection: Prevent duplicate transactions and data corruption that could cost millions.

The difference between a service that “works on my machine” and one that thrives in production lies in these details. Whether you’re running on GKE, EKS, or AKS, these patterns form the foundation of production-ready microservices.

Want to test these patterns yourself? The complete code examples and deployment manifests are available on GitHub.

« Newer PostsOlder Posts »

Powered by WordPress