Task Scheduling Algorithms in Distributed Orchestration Systems

September 13, 2025

Task Scheduling Algorithms in Distributed Orchestration Systems

Filed under: Computing,Concurrency — admin @ 3:03 pm

Modern distributed systems face a fundamental challenge: how to efficiently schedule and execute thousands of tasks across heterogeneous resources while maximizing throughput, minimizing latency, and ensuring fair resource allocation. This challenge becomes even more complex when dealing with workflows, dependencies, and varying resource requirements. I have written about building Formicary, an open-source distributed orchestration engine before and in this post, I’ll explore task scheduling algorithms for executing background tasks, jobs, and workflows through the lens of Formicary. We’ll examine how theoretical scheduling concepts translate into practical implementations in a production-ready system.

Formicary Architecture Overview

Before diving into scheduling algorithms, let’s understand Formicary’s architecture. The system follows a Leader-Follower pattern with two main components:

The Queen (Leader/Server)

API & UI Controllers: RESTful APIs and web dashboard
Job Scheduler: Leader-elected service that polls for pending jobs
Resource Manager: Tracks available ant workers and their capabilities
Job Supervisor: Orchestrates job execution as a DAG
Task Supervisor: Manages individual task lifecycle

The Ants (Followers/Workers)

Executors: Support for Docker, Kubernetes, Shell, HTTP, WebSocket, and custom protocols
Registration System: Workers advertise capabilities via tags and methods
Artifact Management: Handle dependencies and outputs

Key Features Supporting Advanced Scheduling

Formicary includes several features that enable sophisticated scheduling strategies:

Tags & Labels: Route tasks to specific workers based on capabilities
Priority Levels: Jobs can have different priority levels for execution order
Resource Constraints: Define CPU, memory, and storage requirements
Tenant Isolation: Multi-tenant support with quota management
Cron Scheduling: Time-based job scheduling
Concurrent Limits: Control maximum parallel job execution
Dynamic Scaling: Kubernetes-based auto-scaling support

Scheduling Algorithm Decision Flow

Before diving into specific algorithms, let’s visualize how different scheduling strategies route jobs through the system:

Job Execution Lifecycle

Understanding how jobs flow through the system helps illustrate where different scheduling algorithms take effect:

Wait Time Estimation Algorithm

Formicary implements a sophisticated wait time estimation system that helps users understand queue delays and plan accordingly. The algorithm considers multiple factors:

// Simplified wait time calculation
func CalculateWaitTime(jobRequest, queuePosition, historicalStats, availableWorkers) time.Duration {
    // 1. Find position in priority-ordered queue
    queuePosition := findQueuePosition(jobRequest.Priority, jobRequest.SubmissionTime)
    
    // 2. Calculate jobs ahead in queue (70% of executing jobs assumed near completion)
    jobsAhead := queuePosition + int(float64(executingJobs) * 0.7)
    
    // 3. Estimate based on historical average execution time
    if historicalAverage > 0 && availableWorkers > 0 {
        estimatedWait := time.Duration(float64(jobsAhead) / float64(availableWorkers)) * 
                        time.Duration(historicalAverage) * time.Millisecond
    }
    
    // 4. Account for scheduled future execution
    if jobRequest.ScheduledAt.After(time.Now()) {
        scheduleDiff := time.Until(jobRequest.ScheduledAt)
        if scheduleDiff > estimatedWait {
            estimatedWait = scheduleDiff
        }
    }
    
    return estimatedWait
}

Formicary uses JobStatsRegistry to track execution patterns:

type JobStats struct {
    SucceededJobsAverage int64  // Average execution time
    ExecutingJobs       int32   // Currently running
    AntsCapacity        int     // Available worker capacity
    AntsAvailable       bool    // Worker availability status
}

It considers worker availability and capacity constraints:

Calculates minimum available workers across all required task types
Accounts for tag-based routing restrictions
Factors in Kubernetes resource quotas

Formicary orders pending jobs by priority and submission time:

sort.Slice(pendingJobs, func(i, j int) bool {
    if job1.Priority == job2.Priority {
        return job1.CreatedAt.Before(job2.CreatedAt)  // FCFS within priority
    }
    return job1.Priority > job2.Priority  // Higher priority first
})

This estimation helps with:

SLA Management: Predict if jobs will meet deadlines
Capacity Planning: Identify when to scale worker pools
User Experience: Provide realistic wait time expectations
Load Balancing: Route jobs to less congested worker pools

Task Scheduling Algorithms in Practice

Now let’s examine how various scheduling algorithms are implemented or can be achieved in Formicary:

1. First-Come First-Serve (FCFS)

FCFS processes tasks in arrival order using a simple FIFO queue. The algorithm maintains fairness by ensuring no task is starved, but suffers from the “convoy effect” where short jobs wait behind long-running ones. Its characteristics include:

Average waiting time: (sum of waiting times) / number of jobs
Turnaround time: completion_time - arrival_time
No preemption – jobs run to completion

Formicary Implementation: This is Formicary’s default behavior. When jobs are submitted, they’re placed in a PENDING state and processed by the Job Scheduler in submission order.

# Job requests are processed in submission order
job_type: data-processing
description: FCFS example - processed in submission order
tasks:
  - task_type: process-data
    method: DOCKER
    container:
      image: python:3.9
    script:
      - python process_data.py

Pros: Simple, predictable, no starvation Cons: Long-running jobs can block shorter ones, poor average wait times

2. Priority Scheduling

Each job has an assigned priority, with higher priority jobs scheduled first. Priority assignment can be static or dynamic based on various factors. A drawback of of this algorithm is starvation of low-priority jobs but it can be addressed with following techniques:

Aging: Gradually increase priority of waiting jobs
Priority Inversion Protection: Temporary priority boost for resource holders
Fair Share: Ensure each user/tenant gets minimum resource allocation

Formicary Implementation: Jobs support priority levels, and the scheduler considers priority when selecting the next job to execute.

job_type: critical-analysis
priority: 10  # Higher priority job
description: Critical security analysis
tasks:
  - task_type: vulnerability-scan
    method: KUBERNETES
    container:
      image: security-scanner:latest

# Submitting jobs with different priorities
curl -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"job_type": "critical-analysis", "job_priority": 10}' \
  $SERVER/api/jobs/requests

curl -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"job_type": "routine-backup", "job_priority": 1}' \
  $SERVER/api/jobs/requests

Implementation Details: The Job Scheduler queries pending jobs ordered by priority, ensuring high-priority jobs are scheduled first when resources become available.

3. Multilevel Queues – Tag-Based Routing

This algorithm partitions jobs into separate queues based on characteristics (interactive, batch, system). Each queue can use different scheduling algorithms, with inter-queue scheduling typically using fixed priorities or time slicing. Common queue classification strategies include:

Job Type: Interactive, batch, system, real-time
Resource Requirements: CPU-intensive, I/O-intensive, memory-intensive
Duration: Short, medium, long-running jobs
User Class: Premium, standard, background users

Formicary Implementation: Using tags and labels, we can effectively create multilevel queues by routing different job types to specialized worker pools.

# Short-running analysis jobs
job_type: quick-scan
tags: ["fast-worker", "analysis"]
tasks:
  - task_type: preflight-check
    method: DOCKER
    tags: ["cpu-optimized"]
    container:
      image: lightweight-scanner:latest

---
# Medium-duration static analysis
job_type: static-analysis
tags: ["medium-worker", "analysis"] 
tasks:
  - task_type: code-analysis
    method: KUBERNETES
    tags: ["memory-optimized"]
    container:
      image: static-analyzer:latest
      memory_limit: "4Gi"

---
# Long-running dynamic analysis
job_type: dynamic-analysis
tags: ["heavy-worker", "analysis"]
tasks:
  - task_type: device-testing
    method: KUBERNETES
    tags: ["gpu-enabled", "device-farm"]
    container:
      image: dynamic-analyzer:latest
      resources:
        cpu_limit: "8"
        memory_limit: "16Gi"

At a mobile security company, I implemented this pattern with three distinct worker pools:

Fast Workers: Preflight analysis (seconds to minutes)
Medium Workers: Static analysis (a few minutes)
Heavy Workers: Dynamic analysis on device farms (multiple minutes to hours)

4. Resource-Aware Scheduling

This algorithm makes scheduling decisions based on current and predicted resource availability (CPU, memory, storage, network). It considers both resource requirements and system capacity to prevent oversubscription:

Multi-dimensional: CPU, memory, storage, network, custom resources
Quality of Service: Guaranteed, burstable, best-effort resource classes
Affinity/Anti-affinity (e.g.,Kubernetes Scheduler): Placement preferences and constraints

Advanced techniques include:

Bin Packing: First-fit, best-fit, worst-fit algorithms
Resource Fragmentation: Avoid unusable resource leftovers
Overcommitment: Schedule based on statistical usage patterns

Formicary Implementation: Integration with Kubernetes resource management and custom resource tracking.

job_type: ml-training
description: Resource-aware ML model training
tasks:
  - task_type: train-model
    method: KUBERNETES
    tags: ["gpu-node", "ml-workload"]
    container:
      image: tensorflow/tensorflow:latest-gpu
      cpu_request: "4"
      cpu_limit: "8"
      memory_request: "8Gi"
      memory_limit: "16Gi"
      ephemeral_storage_request: "10Gi"
    node_selector:
      hardware: "gpu-enabled"
      instance-type: "ml-optimized"
    tolerations:
      - key: "gpu-workload"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

The Resource Manager tracks worker capabilities and current load, ensuring tasks are only scheduled when adequate resources are available.

5. Matchmaking Scheduler – Affinity-Based Routing

It matches jobs to workers based on capabilities, data locality, and preferences. Uses constraint satisfaction to find optimal job-worker pairings. Matching algorithms include:

Hungarian Algorithm: Optimal assignment for bipartite matching
Market-based: Economic models with bids and auctions
Constraint Satisfaction (Kubernetes/Apache Spark): Match job requirements to worker capabilities

Common locality considerations include:

Data Locality: Schedule jobs where data resides
Network Topology: Minimize network hops and bandwidth usage
Hardware Affinity: GPU jobs to GPU nodes, FPGA workloads to FPGA nodes

Formicary Implementation: Using tags, labels, and Kubernetes affinity rules to achieve data locality and capability matching.

job_type: geo-distributed-processing
description: Process data close to its source
tasks:
  - task_type: process-eu-data
    method: KUBERNETES
    tags: ["eu-region", "gdpr-compliant"]
    container:
      image: data-processor:latest
    affinity:
      node_affinity:
        required_during_scheduling_ignored_during_execution:
          node_selector_terms:
            - match_expressions:
                - key: "region"
                  operator: In
                  values: ["eu-west-1", "eu-central-1"]
                - key: "compliance"
                  operator: In
                  values: ["gdpr"]
    variables:
      DATA_REGION: "eu"
      COMPLIANCE_MODE: "strict"

In mobile security analysis company, I used matchmaking scheduling to manage physical device farms where each device has unique characteristics. The system implemented two-phase matchmaking: first reserving devices based on requirements like platform (iOS/Android), OS version ranges, device type (phone/tablet), and capabilities (SMS, camera, GPS), then using affinity rules to route jobs to the specific reserved device.

Pros: Optimal resource matching, data locality, flexibility Cons: Complex matching logic, potential for suboptimal assignments under constraints

6. Delay Scheduler – Temporal Control

Delay scheduling deliberately postpones task execution until optimal conditions are met, such as data locality, resource availability, or specific timing requirements. The algorithm balances waiting for better conditions against potential starvation, often using configurable timeout thresholds.

Optimal_Delay = min(Max_Wait_Time, Expected_Benefit_Time)
Where:
- Max_Wait_Time = configured upper bound to prevent starvation
- Expected_Benefit_Time = estimated time until optimal conditions
- Locality_Benefit = (Remote_Cost - Local_Cost) / Transfer_Rate

Common delay strategies include:

Data Locality Delay: Wait for data to become available on local nodes
Resource Availability Delay: Wait for preferred resource types to become free
Temporal Delay: Execute at specific times (off-peak hours, scheduled windows)
Condition-Based Delay: Wait for external system states or events

Formicary’s Delay Implementations:

Time-Based Scheduling:

job_type: nightly-etl
cron_trigger: "0 2 * * *"  # 2 AM daily
scheduled_at: "2024-12-25T02:00:00Z"  # One-time future execution

Condition-Based Polling:

job_type: external-dependency-wait
tasks:
  - task_type: wait-for-api
    method: HTTP_GET
    url: https://api.service.com/status
    delay_between_retries: 30s
    retry: 20  # Maximum 10 minutes of polling
    on_exit_code:
      "200": proceed-with-processing    # Service ready
      "404": EXECUTING                  # Keep polling
      "503": EXECUTING                  # Service temporarily unavailable
      "FAILED": abort-job               # Permanent failure

Resource Availability Delay:

job_type: gpu-intensive-training
tasks:
  - task_type: training
    method: KUBERNETES
    tags: ["gpu-v100", "high-memory"]
    timeout: 6h
    # Will delay until specific GPU resources become available

Pros:

Improved data locality and reduced network I/O
Better resource utilization through temporal load balancing
Flexible execution timing for cost optimization
Support for external dependency coordination

Cons:

Increased job latency and scheduling complexity
Risk of starvation without proper timeout mechanisms
Difficulty in predicting optimal delay periods
Potential for cascading delays in dependent workflows

7. Capacity Scheduler – Resource Quotas

Capacity scheduling partitions cluster resources into hierarchical queues with guaranteed minimum capacities and configurable maximum limits. Each queue can elastically use unused capacity from other queues while respecting absolute limits and priority policies.

Queue_Capacity = (Allocated_Resources / Total_Cluster_Resources) × 100%
Effective_Capacity = min(Max_Capacity, Guaranteed_Capacity + Available_Borrowed_Capacity)
Resource_Utilization = Used_Resources / Effective_Capacity

Common principles include:

Capacity Guarantees: Each queue has minimum guaranteed resources
Elastic Sharing: Unused capacity can be borrowed by other queues
Preemption: Higher priority queues can reclaim borrowed resources
Hierarchical Organization: Nested queues for organizational structure

Common queue management strategies include:

FIFO within Queues: Simple first-come-first-served within capacity limits
Priority Ordering: High-priority jobs within queues get preference
Fair Share: Proportional resource distribution among queue users
Preemption Policies: Graceful vs. aggressive resource reclamation

Formicary Implementation Features:

# Organization-level capacity limits
tenant_limits:
  max_concurrent_jobs: 50
  max_cpu_hours_per_day: 200
  max_storage_gb: 500

# Job-level concurrency control  
job_type: batch-processing
max_concurrency: 3  # Limit concurrent instances

Capacity enforcement mechanisms include:

Hard Limits: Absolute maximum resource consumption
Soft Limits: Warning thresholds with potential throttling
Burst Capacity: Temporary exceeding of limits during low contention
Quota Reset Periods: Time-based quota renewals (daily, weekly, monthly)

Pros: Predictable resource allocation, multi-tenant isolation, elastic resource sharing, hierarchical management Cons: Complex configuration, potential resource fragmentation, underutilization during low demand, administrative overhead

8. Fair Scheduler – Multi-Tenant Fairness

It ensures proportional resource sharing among users, groups, or tenants over time. Uses techniques like weighted fair queueing and deficit round-robin to achieve long-term fairness while maintaining efficiency. Common metrics include:

Proportional Share (Hadoop Fair Scheduler): Resources allocated based on weights/quotas
Max-Min Fairness (Kubernetes): Maximize minimum allocation across users
Deadline Fairness: Ensure SLA compliance across tenants

Advanced fair sharing includes:

Hierarchical Fair Sharing: Nested user groups and organizations
Dominant Resource Fairness: Fair allocation across multiple resource types
Lottery Scheduling: Probabilistic fairness using tickets

Formicary Implementation: It implements tenant isolation with quota enforcement.

// Formicary's Fair Scheduling based on actual implementation
type FairScheduler struct {
    jobStatsRegistry *JobStatsRegistry
    serverConfig     *ServerConfig
}

func (fs *FairScheduler) CheckFairSchedulingConstraints(request *JobRequest) error {
    // Multi-level concurrency checking: Organization ? User ? Job-level
    
    // Level 1: Organization concurrency limits
    userExecuting, orgExecuting := fs.jobStatsRegistry.UserOrgExecuting(request)
    
    if orgExecuting >= fs.getMaxConcurrentOrgJobs(request.OrganizationID) {
        return fs.delayJobForConcurrencyExceeded(request, "organization", orgExecuting)
    }
    
    // Level 2: User concurrency limits  
    if userExecuting >= fs.getMaxConcurrentUserJobs(request.UserID) {
        return fs.delayJobForConcurrencyExceeded(request, "user", userExecuting)
    }
    
    // Level 3: Job-type concurrency limits
    executionCount := fs.jobStatsRegistry.GetExecutionCount(request.GetUserJobTypeKey())
    if executionCount >= request.GetMaxConcurrency() {
        return fs.delayJobForConcurrencyExceeded(request, "job-type", int(executionCount))
    }
    
    return nil
}

func (fs *FairScheduler) delayJobForConcurrencyExceeded(request *JobRequest, limitType string, currentCount int) error {
    // Intelligent delay calculation based on historical data
    avgCompletionTime := fs.jobStatsRegistry.GetAverageCompletionTime(request.JobType)
    
    // Dynamic wait factor: 25% of average completion time, bounded between 15-60 seconds
    waitFactor := min(max(avgCompletionTime/4, 15*time.Second), 60*time.Second)
    
    // Randomized delay to prevent thundering herd
    randomDelay := time.Duration(rand.Intn(int(waitFactor))) + waitFactor
    
    // Reschedule with delay
    request.ScheduledAt = time.Now().Add(randomDelay)
    
    // Logarithmic priority degradation (inspired by mobile security company approach)
    if request.Priority > 0 {
        // Priority degradation: log_e(original_priority), minimum of 1
        newPriority := max(1, int(math.Log(float64(request.Priority))))
        request.Priority = newPriority
        
        // Allow zero-priority jobs to bypass concurrency limits (emergency valve)
        if newPriority <= 0 {
            return nil // Allow execution despite limits
        }
    }
    
    // Update schedule attempts counter with exponential backoff
    request.ScheduleAttempts++
    
    return fmt.Errorf("%s concurrency limit exceeded: %d jobs running, rescheduling with %v delay", 
                     limitType, currentCount, randomDelay)
}

// Enhanced concurrency tracking from mobile security company experience
func (fs *FairScheduler) trackConcurrencyMetrics(request *JobRequest) {
    // Real-time metrics for monitoring fairness
    fs.metricsRegistry.Gauge("org_concurrent_jobs", map[string]string{
        "org_id": request.OrganizationID,
        "job_type": request.JobType,
    })
    
    fs.metricsRegistry.Gauge("user_concurrent_jobs", map[string]string{
        "user_id": request.UserID, 
        "job_type": request.JobType,
    })
}

Pros: Prevents monopolization, guarantees minimum service levels Cons: May sacrifice efficiency for fairness, complex weight management

9. Earliest Deadline First (EDF)

Dynamic priority algorithm that assigns highest priority to tasks with earliest deadlines. Optimal for single-processor real-time scheduling if total utilization ? 100%. It uses deadline as the primary scheduling criterion. SJF selects the job with the smallest estimated execution time, minimizing average waiting time.

Schedulability Test: ?(Ci/Ti) ? 1
Where Ci = execution time, Ti = period for periodic tasks

EDF is a dynamic priority scheduling algorithm that assigns highest priority to tasks with the earliest absolute deadlines. It’s provably optimal for single-processor real-time scheduling when total CPU utilization ? 100%, providing maximum schedulability under deadline constraints.

Priority(task) = 1 / (Deadline - Current_Time)
Schedulability_Test: ?(Execution_Time_i / Period_i) ? 1
Laxity = Deadline - Current_Time - Remaining_Execution_Time

Core characteristics include:

Dynamic Priority: Priorities change as deadlines approach
Work-Conserving: Never idles processor when tasks are ready
Deadline-Driven: Scheduling decisions based purely on temporal constraints
Optimal Utilization: Achieves 100% processor utilization when schedulable

Failure modes include:

Domino Effect: Single deadline miss can cascade to subsequent tasks
Thrashing: Excessive context switching under overload conditions
Unpredictable Overload: Graceful degradation requires additional mechanisms

While Formicary does not natively supported, SJF can be approximated using separate queues for different job duration categories:

# Short jobs queue
job_type: quick-validation
tags: ["short-queue"]
estimated_runtime: "5m"

# Long jobs queue  
job_type: full-analysis
tags: ["long-queue"] 
estimated_runtime: "2h"

Deadline assignment strategies include:

Relative_Deadline = Period (for periodic tasks)
Absolute_Deadline = Arrival_Time + Relative_Deadline
Critical_Instant = Simultaneous release of all tasks

It can also be simulated using priority scheduling combined with deadline-aware job submission:

# Simulated EDF using priority and scheduled execution
job_type: time-critical-analysis
priority: {{.UrgencyScore}}  # Calculated based on deadline proximity
scheduled_at: "2024-12-31T23:59:59Z"
timeout: 2h
tasks:
  - task_type: urgent-processing
    method: KUBERNETES
    tags: ["priority-worker"]

Implementation Approach: Calculate priority dynamically based on (current_time - deadline) / estimated_runtime to ensure jobs closer to their deadlines receive higher priority.

Pros: Optimal schedulability, maximum resource utilization, simple algorithm, responsive to urgent tasks Cons: Domino effect failures, requires accurate execution time estimates, poor overload behavior, high context switching overhead

10. Speculative Scheduler

It launches multiple instances of slow-running tasks to reduce tail latency. Uses statistical analysis of execution times to detect stragglers and make speculative execution decisions. Balances resource cost against latency improvement. Algorithms include:

Progress-based: Monitor task completion percentage
Time-based: Tasks running longer than percentile threshold
Resource-based: Launch backup only if resources available

Pros: Reduces tail latency, improves user experience, fault tolerance Cons: Resource waste, coordination overhead, may not help heterogeneous workloads

Formicary Status: Not implemented, but the system provides foundation through:

Task execution monitoring
Historical performance data collection
Resource availability tracking

Conceptual Implementation:

job_type: speculative-execution
tasks:
  - task_type: main-task
    method: KUBERNETES
    timeout: 30m
    speculative_backup:
      enabled: true
      delay_threshold: "150%"  # Start backup if 50% slower than average
      resource_threshold: 0.3  # Only if 30%+ resources available

11: Polling and Sensors

Beyond time-based delays, Formicary supports condition-based scheduling through polling sensors that wait for external conditions to be met:

job_type: sensor-job
description: Wait for external conditions before proceeding
tasks:
  - task_type: wait-for-resource
    method: HTTP_GET
    url: https://api.example.com/resource/123
    delay_between_retries: 15s
    retry: 20  # Poll up to 20 times (5 minutes total)
    timeout: 15s
    on_exit_code:
      "200": process-resource    # Success - proceed
      "404": EXECUTING          # Resource not ready - poll again  
      "FAILED": FATAL           # Server error - abort job

  - task_type: process-resource
    method: DOCKER
    container:
      image: data-processor:latest
    script:
      - echo "Resource is now available, processing..."

The key insight is using EXECUTING as an exit code value, which keeps the task in a polling loop rather than completing or failing.

12. Gang Scheduling

Gang scheduling coordinates simultaneous execution of related tasks that need to communicate or synchronize. Instead of scheduling tasks independently, the system reserves resources for all tasks in a “gang” and schedules them together to avoid partial execution and resource deadlocks. Key principles include:

All-or-Nothing (MPI Applications): Either all tasks in the gang get scheduled or none do
Synchronized Start: Tasks begin execution simultaneously
Resource Reservation (Kubernetes Jobs): Pre-allocate resources for the entire task group
Communication Optimization: Minimize synchronization delays between related tasks

Gang_Size = max(task_count, critical_path_parallelism)
Resource_Requirement = ?(individual_task_resources) for all gang members
Schedulability = available_resources >= Resource_Requirement

Formicary’s Gang Scheduling Implementation: Formicary implements gang scheduling at the job level through its Resource Manager. When a job is scheduled, resources are pre-allocated for ALL tasks before execution begins:

// Core gang scheduling logic from ResourceManager
func (rm *ManagerImpl) doReserveJobResources(
	requestID string,
	def *types.JobDefinition,
	dryRun bool) (reservations map[string]*common.AntReservation, err error) {
	
	reservations = make(map[string]*common.AntReservation)
	var alloc *common.AntReservation
	
	// Try to reserve resources for each task
	for _, task := range def.Tasks {
		alloc, err = rm.doReserve(requestID, task.TaskType, task.Method, task.Tags, dryRun)
		if err == nil {
			reservations[task.TaskType] = alloc
		} else {
			if !dryRun {
				// ALL-OR-NOTHING: Release all allocations and fail entire job
				_ = rm.ReleaseJobResources(requestID)
			}
			return nil, err
		}
	}
	return reservations, nil
}

Two-Phase Gang Scheduling Process:

Resource Check Phase (Dry Run):

// Check if all job resources are available without allocating
func (rm *ManagerImpl) CheckJobResources(job *types.JobDefinition) ([]*common.AntReservation, error) {
	if reservationsByTask, err := rm.doReserveJobResources("", job, true); err != nil {
		return nil, err // Gang scheduling not possible
	}
	// All tasks can be scheduled - proceed to actual reservation
}

Resource Reservation Phase (Actual Allocation):

// Atomically reserve resources for all tasks
func (rm *ManagerImpl) ReserveJobResources(requestID string, def *types.JobDefinition) (map[string]*common.AntReservation, error) {
	return rm.doReserveJobResources(requestID, def, false)
}

Scheduler Integration with Gang Scheduling:

The Job Scheduler uses gang scheduling through a two-step verification process:

// Step 1: Check if gang scheduling is possible
if err = jobStateMachine.CheckAntResourcesAndConcurrencyForJob(); err != nil {
	// Gang scheduling failed - implement backoff strategy
	scheduleAttempts := request.ScheduleAttempts + 1
	scheduleSecs := math.Min(int(maxWait.Seconds()), scheduleAttempts*5)
	
	// Exponential backoff with priority degradation
	if scheduleAttempts >= 5 && scheduleAttempts%5 == 0 && request.JobPriority > 5 {
		request.JobPriority-- // Degrade priority every 5th attempt
	}
	
	request.ScheduledAt = request.ScheduledAt.Add(time.Duration(scheduleSecs) * time.Second)
	return fmt.Errorf("gang scheduling failed - will retry")
}

// Step 2: Perform actual gang scheduling
if err = jobStateMachine.ReserveJobResources(); err != nil {
	// Even after check, allocation failed - very rare race condition
	return fmt.Errorf("gang allocation failed after successful check")
}

Example Job with Gang Scheduling:

job_type: distributed-ml-training
description: Gang scheduled ML training requiring coordinated execution
tasks:
  - task_type: parameter-server
    method: KUBERNETES
    tags: ["ml-cluster", "coordinator"]
    container:
      image: tensorflow:latest
      cpu_request: "4"
      memory_request: "8Gi"
    
  - task_type: worker-node-1
    method: KUBERNETES  
    tags: ["ml-cluster", "gpu-enabled"]
    container:
      image: tensorflow:gpu
      cpu_request: "8"
      memory_request: "16Gi"
      
  - task_type: worker-node-2
    method: KUBERNETES
    tags: ["ml-cluster", "gpu-enabled"]
    container:
      image: tensorflow:gpu
      cpu_request: "8" 
      memory_request: "16Gi"

  - task_type: aggregator
    method: KUBERNETES
    tags: ["ml-cluster"]
    dependencies: ["parameter-server", "worker-node-1", "worker-node-2"]

Formicary’s Gang Scheduling Features:

Atomic Resource Allocation: All task resources are reserved simultaneously
Automatic Rollback: Failed gang allocation releases all previously reserved resources
Backoff Strategy: Jobs that can’t be gang-scheduled use exponential backoff
Priority Degradation: Long-waiting jobs have priority reduced to prevent resource hogging
Resource Fragmentation Prevention: Avoids partial allocations that waste resources

Formicary tracks gang scheduling effectiveness through metrics:

scheduler_failed_total: Jobs that couldn’t be gang-scheduled
scheduler_no_more_jobs_total: Scheduler iterations with no schedulable jobs
Schedule attempts per job to identify resource contention patterns

Pros:

Eliminates partial execution and resource deadlocks
Optimal for tightly-coupled distributed workloads
Automatic retry with intelligent backoff
Priority-based fairness with degradation

Cons:

Can lead to resource underutilization
Higher latency for large job gangs
Complex resource accounting and cleanup
May cause convoy effects for large jobs

Integration with Other Algorithms: Gang scheduling in Formicary works alongside:

Priority Scheduling: Higher priority gangs get resources first
Resource-Aware Scheduling: Considers total gang resource requirements
Fair Scheduling: Gang resource consumption counted toward tenant quotas
Capacity Scheduling: Gangs compete for available cluster capacity

Advanced Scheduling Patterns

Hybrid Scheduling Strategy

In practice, Formicary often combines multiple scheduling algorithms:

job_type: hybrid-ml-pipeline
priority: 8                    # Priority scheduling
max_concurrency: 3             # Capacity scheduling  
tags: ["gpu-cluster"]          # Matchmaking
cron_trigger: "0 */6 * * *"    # Delay scheduling
tasks:
  - task_type: data-preprocessing
    tags: ["cpu-optimized"]     # Multilevel queues
    method: KUBERNETES
    container:
      cpu_request: "2"          # Resource-aware
      memory_request: "4Gi"
  
  - task_type: model-training
    tags: ["gpu-optimized"]
    method: KUBERNETES
    container:
      image: tensorflow:gpu
    dependencies: ["data-preprocessing"]

Fork-Join Pattern for Parallel Processing

Formicary supports sophisticated parallel execution patterns:

job_type: parallel-video-encoding
description: Parallel processing with fork-join
tasks:
  - task_type: split-video
    method: DOCKER
    container:
      image: ffmpeg:latest
    script:
      - ffmpeg -i input.mp4 -f segment segment_%03d.mp4
    on_completed: fork-encode

  - task_type: fork-encode
    method: FORK_JOB
    fork_job_type: encode-segment
    variables:
      segment_count: 8
    on_completed: await-encoding

  - task_type: await-encoding  
    method: AWAIT_FORKED_JOB
    await_forked_tasks: ["fork-encode"]
    on_completed: merge-video

  - task_type: merge-video
    method: DOCKER
    container:
      image: ffmpeg:latest
    script:
      - ffmpeg -f concat -i segments.txt -c copy output.mp4

Performance Optimizations

Cache-Aware Scheduling

Formicary supports dependency caching to improve scheduling efficiency:

job_type: node-build
tasks:
  - task_type: install-deps
    method: DOCKER
    container:
      image: node:16
    cache:
      key: "node-deps-{{checksum 'package-lock.json'}}"
      paths:
        - node_modules
    script:
      - npm ci

Artifact-Based Dependencies

Smart scheduling based on artifact availability:

job_type: deployment-pipeline
tasks:
  - task_type: build
    artifacts:
      paths: ["dist/"]
    
  - task_type: test
    dependencies: ["build"]  # Waits for build artifacts
    
  - task_type: deploy
    dependencies: ["test"]
    method: KUBERNETES

Monitoring and Observability

Effective scheduling requires comprehensive monitoring:

# Built-in metrics and alerts
metrics:
  - queue_depth_by_priority
  - average_wait_time_by_tag
  - resource_utilization_by_worker
  - job_completion_rate_by_tenant

alerts:
  - name: high_queue_depth
    condition: queue_depth > 1000
    action: scale_workers
    
  - name: poor_resource_utilization  
    condition: cpu_utilization < 30%
    action: consolidate_workers

Real-World Case Study: Mobile Security Analysis Platform

At a mobile security company, I implemented a similar system with three-tier scheduling:

Tier 1: Preflight Analysis (Fast Queue)

Duration: 30 seconds – 2 minutes
Workers: CPU-optimized containers
Algorithm: Priority + FCFS
Use Case: Basic file validation, metadata extraction

Tier 2: Static Analysis (Medium Queue)

Duration: 5 minutes – 1 hour
Workers: Memory-optimized containers
Algorithm: Resource-aware + Fair scheduling
Use Case: Code analysis, vulnerability scanning

Tier 3: Dynamic Analysis (Heavy Queue)

Duration: 1 – 8 hours
Workers: GPU-enabled device farm
Algorithm: Matchmaking + Capacity scheduling
Use Case: Runtime behavior analysis, ML inference

This architecture processed over 100,000 mobile apps daily with 99.9% availability and optimal resource utilization.

Best Practices and Lessons Learned

1. Start Simple, Scale Gradually

Begin with FCFS and basic priority scheduling. Add complexity as your workload characteristics become clear.

2. Observability

Track queue depths, wait times, resource utilization, and job completion rates by different dimensions (tenant, job type, worker pool).

3. Design for Elasticity

Use Kubernetes HPA and custom metrics to automatically scale worker pools based on queue depth and resource utilization.

4. Implement Circuit Breakers

Prevent cascading failures when downstream services are unavailable.

5. Use Dead Letter Queues

Handle persistently failing tasks gracefully:

error_handling:
  max_retries: 3
  dead_letter_queue: "failed-jobs"
  alert_on_dlq: true

Earliest Deadline First: For time-sensitive workflows
Speculative Execution: For fault tolerance and performance
Gang Scheduling: For tightly-coupled parallel jobs

Conclusion

Formicary demonstrates how theoretical scheduling algorithms translate into practical distributed systems. It combines multiple strategies—priority scheduling, resource awareness, fair sharing, and intelligent routing for handling diverse workloads while maintaining predictable performance. The key insight is that real-world schedulers rarely use a single algorithm. Instead, they combine multiple approaches, leveraging the strengths of each for different aspects of the scheduling problem. Tags and labels provide the flexibility to implement sophisticated routing logic, while Kubernetes integration enables resource-aware scheduling at scale.

Whether you’re building CI/CD pipelines, data processing workflows, or ML training systems, understanding these scheduling patterns and their trade-offs is crucial for designing systems that scale efficiently and reliably.

Formicary is open source and available at github.com/bhatti/formicary. Try it out for your next workflow automation project!

Shahzad Bhatti Welcome to my ramblings and rants!

September 13, 2025