Killing the State Machine: Declarative AI Coding Agents with an Orchestration System

June 1, 2026

Killing the State Machine: Declarative AI Coding Agents with an Orchestration System

Filed under: Agentic AI — admin @ 7:16 pm

Background

I have built a number of agentic systems over the last year. I built a PII detection system with LangChain and Vertex AI that scans documents and redacts sensitive data without human review. I built an API compatibility guardian using LangGraph that catches breaking changes before they reach production. And I built a production-grade enterprise AI platform on vLLM serving multiple teams and use cases. Most recently I wrote a complete guide to production AI agents with MCP and A2A.

Alongside that work I have been using agentic coding tools heavily by letting AI write code while I own the architecture and design. I documented that approach in AI Writes Code, You Own the Design, which covers how to use skills with structured methodology files to make AI coding agents produce consistent, reviewable, architecturally sound output instead of chaos.

But there’s a deeper layer of context. Over ten years ago, before GitHub Actions and GitLab Runner existed as concepts, I built a distributed orchestration engine for automating heterogeneous tasks with declarative syntax. It used Docker, Kubernetes, shell scripts, and custom worker types to handle diverse workloads. The core insight then is the same insight that applies now: scheduling, fault tolerance, retries, timeouts, observability, and capacity management are solved problems. Your application should not implement them. That engine became Formicary, which I open-sourced. This post shows how I applied Formicary to automated agentic coding workflows and why enterprises keep making the same expensive mistake.

The Problem I Keep Seeing

When teams build AI coding agents like systems that pick up GitHub issues, plan implementations, write code, run tests, and open PRs, they reach for the obvious approach: a coordinator process, a state machine, custom pollers. The initial version works. Then it accumulates. I have seen enterprises building custom solutions with 50K+ lines of TypeScript. Look inside these systems and you find the same failure modes every time:

No per-phase timeouts. If the AI model hangs during implementation, the process runs until a global job timeout kills it — often 90 minutes later, after consuming an expensive model session and blocking other work.
Silent work drop. When the worker pool fills, the system silently skips newly discovered issues instead of queuing them.
Context loss between phases. The planner writes a plan file. The implementer starts a fresh AI session and re-explores the entire codebase from scratch. The planning work gets thrown away.
Custom DAG reinvention. The state machine handles branching: tests fail -> retry, model blocked -> notify human. This is just a DAG with exit-code routing. It’s already solved, and the custom version is always underpowered.
Crummy restarts. Retry a failed issue and the agent reuses the same branch name. Git conflict. Failure. Start over.
Infrastructure lock-in. You can’t run it on a laptop because it’s tangled with Kubernetes pod lifecycle management.
High cost per new feature. Adding a security review phase means new state transitions, new code, a new deployment takes days of engineering time.

The root mistake is treating orchestration as application logic. These teams write scheduling, capacity management, artifact passing, observability, and retry logic inside their agent code. Every one of those concerns is already solved by mature orchestration frameworks. Stop writing that code.

The Declarative Replace

I have used a 50K+ lines TypeScript agent system in an enterprise environment, which I replaced with a few declarative workflow definitions such as:

ai-gh-issue-picker.yaml   (~100 lines)  — polls GitHub, submits jobs
ai-gh-implement.yaml      (~500 lines)  — plan -> implement -> test -> verify -> PR -> monitor -> learn
ai-gh-cleanup.yaml        (~80 lines)   — stale workspace and branch cleanup

No orchestration code. No state machine. No custom pollers. No retry logic. No timeout management. Formicary handles all of it.

Here is every decision, with the reasoning.

Decision 1: Replace Custom Pollers with a Cron Job

Custom polling processes run continuously, consume resources, and require their own deployment lifecycle. I replaced the GitHub issue poller with a Formicary cron job:

job_type: ai-gh-issue-picker
cron_trigger: "0 * * * * * *"   # every minute (7-field cron)
max_concurrency: 1               # only one picker at a time

skip_if: >-
  {{if ge (CountByJobTypeAndState "ai-gh-implement" "PENDING") 10}} true {{end}}

The skip_if fires at the scheduler level before any worker is allocated, before any task runs. If 10 implement jobs are already pending, Formicary skips the entire picker invocation silently. Zero worker cost.

The gather-issues task fetches GitHub issues labeled ai-ready, moves each label to ai-in-progress, and writes a compact issues.json. I wrote it in Python rather than bash because Python eliminates the jq/base64/subshell-scoping traps that plagued the original version:

import json, os, subprocess

repo = f"{os.environ['GH_ORG']}/{os.environ['GH_REPO']}"

def gh(*args):
    r = subprocess.run(["gh"] + list(args), capture_output=True, text=True)
    return r

r = gh("issue", "list", "-R", repo,
       "--label", os.environ["PICKUP_LABEL"], "--state", "open",
       "--limit", os.environ.get("MAX_PENDING", "10"),
       "--json", "number,title,url")
issues = json.loads(r.stdout) if r.returncode == 0 else []

for issue in issues:
    gh("issue", "edit", str(issue["number"]), "-R", repo,
       "--remove-label", os.environ["PICKUP_LABEL"],
       "--add-label", os.environ["INPROGRESS_LABEL"])

issues_json = json.dumps(issues, separators=(',', ':'))
with open("issues.json", "w") as f:
    f.write(issues_json + "\n")
print(f"::set-output name=IssuesJSON::{issues_json}")

The submit-jobs task uses SubmitJobsFromJSON, a Formicary template function that submits one implement job per issue directly through the DB. A unique index on user_key (keyed as ai-gh-implement-{org}-{repo}-{number}) rejects duplicate submissions at the constraint level. No pre-flight lookups, no race conditions:

environment:
  SUBMITTED_IDS: >-
    {{if .IssuesJSON}}{{SubmitJobsFromJSON "ai-gh-implement" .IssuesJSON
        (printf "GitHubOrg=%s" .GitHubOrg) (printf "GitHubRepo=%s" .GitHubRepo)}}{{end}}
  PENDING_COUNT: '{{CountByJobTypeAndState "ai-gh-implement" "PENDING"}}'

Decision 2: Replace the State Machine with a DAG

A 12-state custom state machine becomes a named DAG in YAML. The full pipeline looks like this:

Exit-code routing handles every branch. No code required:

- task_type: implement
  on_exit_code:
    COMPLETED: unit-test
    "2": notify-blocked    # model signals blocked
    "3": fix-tests         # tests failing
  on_failed: notify-blocked

The unit-test task verifies commits exist, shows the diff, then detects and runs the project’s test suite, it checks for Makefile, Cargo.toml, package.json, go.mod, or pytest and runs whichever it finds. If no commits were made, it fails immediately. If tests fail, it routes to fix-tests. The self-verify task runs a separate AI reviewer session that runs tests, checks correctness, checks security, and verifies the implementation matches the issue. A fresh context catches mistakes the implementer’s context was blind to. If self-verify cannot resolve a problem, create-pr still runs but the PR body explicitly states what remains unresolved. Silently creating PRs with known failures is a common failure mode in imperative systems, I designed against it.

Decision 3: Give Every Phase Its Own Timeout

The biggest operational gap in imperative agents is missing per-phase timeouts. I gave every task its own:

- task_type: plan
  timeout: 15m

- task_type: implement
  timeout: 45m

- task_type: unit-test
  timeout: 10m

- task_type: self-verify
  timeout: 15m

- task_type: cleanup
  always_run: true    # runs even if the job fails
  timeout: 1m

always_run: true on cleanup guarantees Formicary removes the workspace and branch regardless of outcome. Without it, stuck jobs leak temporary directories and dead branches indefinitely.

Decision 4: Flow Context Forward Through Artifacts

Imperative bots lose context between phases because each phase is a separate pod with no shared state. The planner’s work gets discarded. I solved this years ago with a shared workspace and an artifact chain:

Each task declares its dependencies and Formicary downloads the upstream artifacts automatically:

- task_type: self-verify
  dependencies:
    - setup       # downloads meta.env
    - implement   # downloads impl_result.json, impl_conversation.txt, impl_diff.patch
  script:
    - |
      TASK_DIR="$PWD"          # capture executor dir before any cd
      source "$TASK_DIR/meta.env"
      cd "$WS/repo"
      # all artifacts available in $TASK_DIR/

One critical detail: save TASK_DIR="$PWD" before any cd. Artifacts must be written back to the executor’s working directory, not to the repo:

TASK_DIR="$PWD"
source "$TASK_DIR/meta.env"
cd "$WS/repo"
# ... do work ...
jq ... > "$TASK_DIR/result.json"   # write to TASK_DIR, not to repo

The implementer now reads PLAN.md that the planner wrote. Context survives across phases.

Decision 5: Use Nonces to Make Restarts Safe

One issue with imperative implementation was that when a job retried a failed issue, it reused the same branch name. Git conflict. In the workflow definition, I added a 4-byte random hex nonce to every branch:

NONCE=$(head -c 4 /dev/urandom | xxd -p)
BRANCH="ai/{{.IssueNumber}}-${SLUG}-${NONCE}"
# e.g., ai/42-fix-login-timeout-a3f1

retry: 1 on the implement job submits a fresh attempt with a new nonce -> new branch -> no conflicts. The ai-gh-cleanup job removes stale branches after PR merge.

Decision 6: Stream Output and Extract Structured Status

I need two things simultaneously: real-time visibility of what the agent is doing, and structured status for routing decisions. claude --print streams output through tee, while the prompt instructs Claude to output a JSON status object on its final line:

claude --print --dangerously-skip-permissions --model "$MODEL" --max-turns 100 \
  "$(cat /tmp/impl_prompt.txt)" 2>&1 | tee "$TASK_DIR/impl_conversation.txt"

# Extract the last JSON object with a "status" key
STATUS_JSON=$(grep -oE '\{[^{}]*"status"[^{}]*\}' \
  "$TASK_DIR/impl_conversation.txt" | tail -1)
STATUS=$(echo "$STATUS_JSON" | jq -r '.status // "UNKNOWN"')
[ "$STATUS" = "BLOCKED" ] && exit 2
[ "$STATUS" = "TESTS_FAILING" ] && exit 3

--dangerously-skip-permissions is required. Without it, Claude only produces text describing what it would do, zero file changes, zero commits. With it, Claude actually reads files, writes code, and runs tests. This gives me four things at once: real-time streaming to the Formicary dashboard, exit-code routing from the status field, artifact data for downstream tasks, and the full AI conversation captured as a debuggable artifact.

Decision 7: Encode Methodology in Skills

I don’t ask Claude to “write some code.” I embed skill instructions that encode engineering discipline into every prompt. I wrote about this approach in depth in AI Writes Code, You Own the Design, the core idea is that freeform prompting produces inconsistent output, while skill-encoded prompting produces output that follows a contract.

claude --print --model opus --max-turns 30 \
  "Use the ygs-wbs skill approach:
   1. Explore the codebase
   2. Decompose into vertical-slice tasks
   3. Write PLANS/{issue-slug}-{number}-plan.md with acceptance criteria"

If you-got-skills is installed on the worker, Claude discovers /ygs-wbs as a slash command automatically. The prompt-embedded version works either way, no dependency on the skills package being present.

The four skills that shape this pipeline:

Phase	Skill	What it enforces
plan	ygs-wbs	Vertical slices, acceptance criteria, explicit scope
implement	ygs-implement	Atomic commits, tests after each task, scope guardrails
fix-tests	ygs-investigate	Root cause analysis, not symptom masking
self-verify	ygs-code-review	Run tests, check correctness, fix critical issues

Each skill acts as a contract. “Plan vertically, commit atomically, stop when blocked” produces far more consistent and reviewable output than open-ended instructions.

Decision 8: Make the Dashboard Show What’s Happening

Formicary’s job description field accepts markdown. Every submitted implement job carries clickable links to the issue, branch, and PR:

{
  "job_type": "ai-gh-implement",
  "description": "#42: Fix login timeout | [org/repo](https://github.com/org/repo)",
  "params": {
    "IssueLink": "[#42: Fix login timeout](https://github.com/org/repo/issues/42)",
    "BranchLink": "[ai/42-fix-login-a3f1](https://github.com/org/repo/tree/ai/42-fix-login-a3f1)",
    "PRLink": ""
  }
}

The PRLink starts empty and the create-pr task populates it once the PR exists. Every job in the dashboard now shows exactly what it’s working on with one-click navigation to the relevant GitHub page.

Decision 9: Capture Everything as Artifacts

Every task uploads artifacts with when: always including on failure. This is what makes debugging possible rather than a guessing game:

Artifact	Contents
`plan_conversation.txt`	Full AI conversation during planning
`plan_result.json`	Status, complexity, task count, summary
`impl_conversation.txt`	Full AI conversation during implementation
`impl_result.json`	Status, files changed, commit count
`impl_diff.patch`	Complete git diff of all changes
`impl_commits.txt`	List of commits made
`test_output.txt`	Test suite output with pass/fail details
`verify_result.json`	Test pass/fail, critical findings, any fixes
`verify_conversation.txt`	Full AI conversation during self-verify

Every task also sets report_stdout: true, Formicary streams output to the dashboard websocket in real time. Combined with tee, you see the full AI conversation live as it happens. The workspace also persists locally at ~/claude_workspace/{issue}-{nonce} so you can cd into it after a run and inspect exactly what happened.

Decision 10: Monitor PRs and Capture Learnings

Imperative bots typically run a PR comment poller that fires every few minutes, scanning for mentions. I replaced it with a task inside the implement job that lives as long as the PR stays open:

The monitor-pr task:

Polls for new PR review comments every 2 minutes
Feeds each new comment to Claude, applies the change, commits, and pushes
Replies on the PR confirming the fix
Tracks processed comment IDs in $WS/.processed_comments to avoid re-processing
Exits when the PR merges or closes

The learn task runs after the PR closes. It reviews all PR comments, reviewer feedback, and the implementation conversation, then writes a structured learning entry to ~/claude_workspace/learn_context/ using the ygs-learn skill methodology: what went well, what to improve, patterns to remember for this codebase. Over time the agent gets better at this specific repo, not just better in general.

- task_type: monitor-pr
  method: SHELL
  timeout: 24h

- task_type: learn
  method: SHELL
  # reviews PR feedback, writes to ~/claude_workspace/learn_context/

Decision 11: Support Multiple Trackers with Minimal Changes

The pipeline is intentionally tracker-agnostic. Only two tasks touch the issue tracker API: gather-issues in the picker, and create-pr plus monitor-pr in the implement job. Everything else: plan, implement, unit-test, self-verify, learn works identically regardless of tracker.

To support Jira and Bitbucket, I cloned the YAML files and swapped six commands:

gh issue list -> acli jira search --jql ...
gh issue edit -> acli jira issue update
git clone git@github.com: -> git clone git@bitbucket.org:
gh pr create -> acli bitbucket pr create
gh pr view -> acli bitbucket pr get
gh api .../comments -> acli bitbucket pr comment list

Result: ai-jira-issue-picker.yaml and ai-jira-implement.yaml, the same complete pipeline, different API calls. Both use the Atlassian CLI (acli) configured at ~/.config/acli/config.json.

What Formicary Gives You Without Writing a Line

When I started applying Formicary to agentic coding, I wasn’t sure it had everything needed. It had almost all of it already:

Cron: scheduling with 7-field syntax (including seconds)
Per-task timeouts: the feature imperative bots most consistently lack
Exit-code routing (on_exit_code): conditional DAG without custom code
always_run: true: guaranteed cleanup regardless of failure
Artifact: passing between tasks via S3
Encrypted secrets: with automatic log redaction
max_concurrency: capacity management declared in YAML
retry + delay_between_retries: automatic backoff
Go template functions: variable substitution in scripts
SHELL executor: runs on a laptop with no Kubernetes
KUBERNETES executor production-grade pod-per-task isolation
Markdown in job descriptions: visible, clickable in the dashboard

Two additions were made specifically for this use case.

Native Kubernetes secret injection. The naive pattern passes API keys through the orchestrator as template variables, which stores them in the job definition. The new pattern lets the kubelet inject them at pod start time, the value never touches Formicary:

container:
  image: ghcr.io/formicary-ai/agent-worker:latest
  env_from:
    - secret_ref: claude-bedrock-settings
    - secret_ref: ai-agent-secrets

Or for a single named key:

container:
  env_value_from:
    - name: ANTHROPIC_API_KEY
      secret_name: ai-agent-secrets
      key: anthropic-api-key

Per-task service accounts work the same way for IRSA on AWS or Workload Identity on GCP:

container:
  service_account: ai-agent-irsa-sa

CountByJobTypeAndState template function. The original capacity check made an HTTP API call requiring a token, an available endpoint, and network round-trip time. The new function queries the job database directly at the scheduler level before any worker is allocated:

skip_if: >-
  {{if ge (CountByJobTypeAndState "ai-gh-implement" "PENDING" "EXECUTING") 10}} true {{end}}

If the count hits the threshold, Formicary skips the entire job invocation with zero cost. The script also does a fine-grained check using the configurable MaxPendingJobs variable. Two layers: cheap early termination at the scheduler, tunable limits inside the task.

The Numbers

Metric	Imperative Bot	Formicary Declarative
Lines of orchestration code	~50,000 LOC	~700 lines YAML
State machine states	12+	0 (implicit in DAG)
Custom pollers	Multiple	0
Per-phase timeouts	None	Yes, per-task
Context between phases	Lost (new pod, new session)	Preserved via artifact chain
Runs locally without K8s	No	Yes (SHELL executor)
K8s isolation in production	Pod-per-job	Pod-per-task
Time to add a new phase	Hours to days	Minutes (copy task block, change prompt)
Restart safety	Branch conflicts	Nonce-based, no conflicts
Real-time output	Text logs only	Dashboard streaming + tee
Diagnostics on failure	Text logs	Full AI conversations + diffs as artifacts
Capacity check cost	HTTP API call	DB query at scheduler level
Verification	Limited	unit-test + self-verify (separate AI session)
Multi-tracker	One tracker hardcoded	Clone YAML, swap 6 commands
Continuous learning	None	learn task after every PR close
Secret injection	Env vars on host	Native Kubernetes env_from / env_value_from

Getting Started

Option A: SHELL executor (local dev, fastest path)

This is where to start. The SHELL executor runs scripts directly on the host and inherits ~/.claude/settings.json, gh auth login, and all other host credentials automatically, no secrets configuration needed.

# 1. Prerequisites (one-time)
npm install -g @anthropic-ai/claude-code
gh auth login

# 2. Start Formicary (queen + embedded ant worker)
docker pull plexobject/formicary
docker run plexobject/formicary

# 3. Deploy workflow definitions
git clone https://github.com/bhatti/formicary.git
cd docs/examples
./deploy-ai-workflows.sh --mode shell --repo your-org/your-repo --setup-labels

# 4. Set org config so the picker knows where to look
curl -X POST http://localhost:7777/api/orgs/default/configs \
  -H 'Content-Type: application/json' \
  -d '{"name":"GitHubOrg","value":"your-org"}'
curl -X POST http://localhost:7777/api/orgs/default/configs \
  -H 'Content-Type: application/json' \
  -d '{"name":"GitHubRepo","value":"your-repo"}'

# 5. Label an issue — the picker fires within 1 minute
gh issue edit 1 --repo your-org/your-repo --add-label "ai-ready"

# 6. Watch it run
open http://localhost:7777

Option B: Kubernetes with Bedrock via Tailscale

Pods can’t resolve Tailscale hostnames by name, but they can reach the IP. Resolve it once:

TAILSCALE_IP=$(python3 -c "import socket; print(socket.gethostbyname('ai'))")

kubectl create namespace formicary-ai

kubectl create secret generic claude-bedrock-settings \
  --namespace=formicary-ai \
  --from-literal=ANTHROPIC_BEDROCK_BASE_URL=http://${TAILSCALE_IP}/bedrock \
  --from-literal=CLAUDE_CODE_USE_BEDROCK=1 \
  --from-literal=CLAUDE_CODE_SKIP_BEDROCK_AUTH=1 \
  --from-literal=ANTHROPIC_DEFAULT_OPUS_MODEL=us.anthropic.claude-opus-4-6-v1 \
  --from-literal=ANTHROPIC_DEFAULT_SONNET_MODEL=us.anthropic.claude-sonnet-4-6 \
  --from-literal=ANTHROPIC_DEFAULT_HAIKU_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0

kubectl create secret generic ai-agent-secrets \
  --namespace=formicary-ai \
  --from-literal=github-token=$(gh auth token)

If the Tailscale IP changes, regenerate the secret with --dry-run=client -o yaml | kubectl apply -f -.

Option C: Standard Anthropic API key

kubectl create secret generic ai-agent-secrets \
  --from-literal=anthropic-api-key=sk-ant-... \
  --from-literal=github-token=$(gh auth token)

Job YAMLs reference it with env_value_from, so the key is injected by the kubelet and never passes through Formicary.

Ten Lessons

Timeouts are not optional. AI models hang. Give every phase its own timeout. A global job timeout is not a substitute when the plan phase hangs, you want to retry that phase, not restart the whole job from scratch.
Structured JSON output unlocks routing. Ask the AI to output {"status": "DONE|BLOCKED|TESTS_FAILING", ...} on its final line. Route on that field. Extract metadata for dashboards.
Flow context forward. If planning and implementation run in separate sessions with no shared artifacts, the implementer re-explores the entire codebase and discards all planning work. Pass PLAN.md as an artifact. Cost and quality both improve.
Use nonces for idempotency. Branch names, workspace paths, artifact names, all need a per-run nonce. Never reuse a name across retry attempts.
Guarantee cleanup. Set always_run: true on cleanup tasks. Workspaces and branches accumulate fast. One stuck job should not leave garbage forever.
Let the orchestrator manage capacity. Set max_concurrency on the job and use skip_if with a scheduler-level DB query. Don’t write custom capacity management code, it will be wrong.
Skills are the real leverage. The quality gap between freeform prompting and methodology-encoded prompting is large. Invest in skill definitions. The skill is a contract: “plan vertically, commit atomically, stop when blocked.” Consistent contracts produce consistent, reviewable output. I covered this in depth in AI Writes Code, You Own the Design.
Declarative wins operationally. Adding a security review phase to the declarative version takes minutes: copy a task block, write a prompt, add an on_completed route. The same change to an imperative system takes days. The asymmetry grows with every phase you add.
Capture everything on failure. Upload artifacts with when: always. When something fails, you want the full AI conversation, the git diff, and the test output — not just “job failed.”
Build a feedback loop. Most AI coding systems run, merge, and forget. The learn task after every PR close gives the agent a memory of what works and what doesn’t in this specific codebase. Over time, that compounds.

References

The job definitions described in this post are in docs/examples/ in the Formicary repository. See docs/ai-agents.md for the full setup guide.

Shahzad Bhatti Welcome to my ramblings and rants!

June 1, 2026