Background
I have built a number of agentic systems over the last year. I built a PII detection system with LangChain and Vertex AI that scans documents and redacts sensitive data without human review. I built an API compatibility guardian using LangGraph that catches breaking changes before they reach production. And I built a production-grade enterprise AI platform on vLLM serving multiple teams and use cases. Most recently I wrote a complete guide to production AI agents with MCP and A2A.
Alongside that work I have been using agentic coding tools heavily by letting AI write code while I own the architecture and design. I documented that approach in AI Writes Code, You Own the Design, which covers how to use skills with structured methodology files to make AI coding agents produce consistent, reviewable, architecturally sound output instead of chaos.
But there’s a deeper layer of context. Over ten years ago, before GitHub Actions and GitLab Runner existed as concepts, I built a distributed orchestration engine for automating heterogeneous tasks with declarative syntax. It used Docker, Kubernetes, shell scripts, and custom worker types to handle diverse workloads. The core insight then is the same insight that applies now: scheduling, fault tolerance, retries, timeouts, observability, and capacity management are solved problems. Your application should not implement them. That engine became Formicary, which I open-sourced. This post shows how I applied Formicary to automated agentic coding workflows and why enterprises keep making the same expensive mistake.
The Problem I Keep Seeing
When teams build AI coding agents like systems that pick up GitHub issues, plan implementations, write code, run tests, and open PRs, they reach for the obvious approach: a coordinator process, a state machine, custom pollers. The initial version works. Then it accumulates. I have seen enterprises building custom solutions with 50K+ lines of TypeScript. Look inside these systems and you find the same failure modes every time:
- No per-phase timeouts. If the AI model hangs during implementation, the process runs until a global job timeout kills it — often 90 minutes later, after consuming an expensive model session and blocking other work.
- Silent work drop. When the worker pool fills, the system silently skips newly discovered issues instead of queuing them.
- Context loss between phases. The planner writes a plan file. The implementer starts a fresh AI session and re-explores the entire codebase from scratch. The planning work gets thrown away.
- Custom DAG reinvention. The state machine handles branching: tests fail -> retry, model blocked -> notify human. This is just a DAG with exit-code routing. It’s already solved, and the custom version is always underpowered.
- Crummy restarts. Retry a failed issue and the agent reuses the same branch name. Git conflict. Failure. Start over.
- Infrastructure lock-in. You can’t run it on a laptop because it’s tangled with Kubernetes pod lifecycle management.
- High cost per new feature. Adding a security review phase means new state transitions, new code, a new deployment takes days of engineering time.
The root mistake is treating orchestration as application logic. These teams write scheduling, capacity management, artifact passing, observability, and retry logic inside their agent code. Every one of those concerns is already solved by mature orchestration frameworks. Stop writing that code.
The Declarative Replace
I have used a 50K+ lines TypeScript agent system in an enterprise environment, which I replaced with a few declarative workflow definitions such as:
ai-gh-issue-picker.yaml (~100 lines) — polls GitHub, submits jobs ai-gh-implement.yaml (~500 lines) — plan -> implement -> test -> verify -> PR -> monitor -> learn ai-gh-cleanup.yaml (~80 lines) — stale workspace and branch cleanup
No orchestration code. No state machine. No custom pollers. No retry logic. No timeout management. Formicary handles all of it.
Here is every decision, with the reasoning.
Decision 1: Replace Custom Pollers with a Cron Job
Custom polling processes run continuously, consume resources, and require their own deployment lifecycle. I replaced the GitHub issue poller with a Formicary cron job:

job_type: ai-gh-issue-picker
cron_trigger: "0 * * * * * *" # every minute (7-field cron)
max_concurrency: 1 # only one picker at a time
skip_if: >-
{{if ge (CountByJobTypeAndState "ai-gh-implement" "PENDING") 10}} true {{end}}The skip_if fires at the scheduler level before any worker is allocated, before any task runs. If 10 implement jobs are already pending, Formicary skips the entire picker invocation silently. Zero worker cost.
The gather-issues task fetches GitHub issues labeled ai-ready, moves each label to ai-in-progress, and writes a compact issues.json. I wrote it in Python rather than bash because Python eliminates the jq/base64/subshell-scoping traps that plagued the original version:
import json, os, subprocess
repo = f"{os.environ['GH_ORG']}/{os.environ['GH_REPO']}"
def gh(*args):
r = subprocess.run(["gh"] + list(args), capture_output=True, text=True)
return r
r = gh("issue", "list", "-R", repo,
"--label", os.environ["PICKUP_LABEL"], "--state", "open",
"--limit", os.environ.get("MAX_PENDING", "10"),
"--json", "number,title,url")
issues = json.loads(r.stdout) if r.returncode == 0 else []
for issue in issues:
gh("issue", "edit", str(issue["number"]), "-R", repo,
"--remove-label", os.environ["PICKUP_LABEL"],
"--add-label", os.environ["INPROGRESS_LABEL"])
issues_json = json.dumps(issues, separators=(',', ':'))
with open("issues.json", "w") as f:
f.write(issues_json + "\n")
print(f"::set-output name=IssuesJSON::{issues_json}")The submit-jobs task uses SubmitJobsFromJSON, a Formicary template function that submits one implement job per issue directly through the DB. A unique index on user_key (keyed as ai-gh-implement-{org}-{repo}-{number}) rejects duplicate submissions at the constraint level. No pre-flight lookups, no race conditions:
environment:
SUBMITTED_IDS: >-
{{if .IssuesJSON}}{{SubmitJobsFromJSON "ai-gh-implement" .IssuesJSON
(printf "GitHubOrg=%s" .GitHubOrg) (printf "GitHubRepo=%s" .GitHubRepo)}}{{end}}
PENDING_COUNT: '{{CountByJobTypeAndState "ai-gh-implement" "PENDING"}}'Decision 2: Replace the State Machine with a DAG
A 12-state custom state machine becomes a named DAG in YAML. The full pipeline looks like this:

Exit-code routing handles every branch. No code required:
- task_type: implement
on_exit_code:
COMPLETED: unit-test
"2": notify-blocked # model signals blocked
"3": fix-tests # tests failing
on_failed: notify-blockedThe unit-test task verifies commits exist, shows the diff, then detects and runs the project’s test suite, it checks for Makefile, Cargo.toml, package.json, go.mod, or pytest and runs whichever it finds. If no commits were made, it fails immediately. If tests fail, it routes to fix-tests. The self-verify task runs a separate AI reviewer session that runs tests, checks correctness, checks security, and verifies the implementation matches the issue. A fresh context catches mistakes the implementer’s context was blind to. If self-verify cannot resolve a problem, create-pr still runs but the PR body explicitly states what remains unresolved. Silently creating PRs with known failures is a common failure mode in imperative systems, I designed against it.
Decision 3: Give Every Phase Its Own Timeout
The biggest operational gap in imperative agents is missing per-phase timeouts. I gave every task its own:
- task_type: plan timeout: 15m - task_type: implement timeout: 45m - task_type: unit-test timeout: 10m - task_type: self-verify timeout: 15m - task_type: cleanup always_run: true # runs even if the job fails timeout: 1m
always_run: true on cleanup guarantees Formicary removes the workspace and branch regardless of outcome. Without it, stuck jobs leak temporary directories and dead branches indefinitely.
Decision 4: Flow Context Forward Through Artifacts
Imperative bots lose context between phases because each phase is a separate pod with no shared state. The planner’s work gets discarded. I solved this years ago with a shared workspace and an artifact chain:

Each task declares its dependencies and Formicary downloads the upstream artifacts automatically:
- task_type: self-verify
dependencies:
- setup # downloads meta.env
- implement # downloads impl_result.json, impl_conversation.txt, impl_diff.patch
script:
- |
TASK_DIR="$PWD" # capture executor dir before any cd
source "$TASK_DIR/meta.env"
cd "$WS/repo"
# all artifacts available in $TASK_DIR/One critical detail: save TASK_DIR="$PWD" before any cd. Artifacts must be written back to the executor’s working directory, not to the repo:
TASK_DIR="$PWD" source "$TASK_DIR/meta.env" cd "$WS/repo" # ... do work ... jq ... > "$TASK_DIR/result.json" # write to TASK_DIR, not to repo
The implementer now reads PLAN.md that the planner wrote. Context survives across phases.
Decision 5: Use Nonces to Make Restarts Safe
One issue with imperative implementation was that when a job retried a failed issue, it reused the same branch name. Git conflict. In the workflow definition, I added a 4-byte random hex nonce to every branch:
NONCE=$(head -c 4 /dev/urandom | xxd -p)
BRANCH="ai/{{.IssueNumber}}-${SLUG}-${NONCE}"
# e.g., ai/42-fix-login-timeout-a3f1retry: 1 on the implement job submits a fresh attempt with a new nonce -> new branch -> no conflicts. The ai-gh-cleanup job removes stale branches after PR merge.
Decision 6: Stream Output and Extract Structured Status
I need two things simultaneously: real-time visibility of what the agent is doing, and structured status for routing decisions. claude --print streams output through tee, while the prompt instructs Claude to output a JSON status object on its final line:
claude --print --dangerously-skip-permissions --model "$MODEL" --max-turns 100 \
"$(cat /tmp/impl_prompt.txt)" 2>&1 | tee "$TASK_DIR/impl_conversation.txt"
# Extract the last JSON object with a "status" key
STATUS_JSON=$(grep -oE '\{[^{}]*"status"[^{}]*\}' \
"$TASK_DIR/impl_conversation.txt" | tail -1)
STATUS=$(echo "$STATUS_JSON" | jq -r '.status // "UNKNOWN"')
[ "$STATUS" = "BLOCKED" ] && exit 2
[ "$STATUS" = "TESTS_FAILING" ] && exit 3--dangerously-skip-permissions is required. Without it, Claude only produces text describing what it would do, zero file changes, zero commits. With it, Claude actually reads files, writes code, and runs tests. This gives me four things at once: real-time streaming to the Formicary dashboard, exit-code routing from the status field, artifact data for downstream tasks, and the full AI conversation captured as a debuggable artifact.
Decision 7: Encode Methodology in Skills
I don’t ask Claude to “write some code.” I embed skill instructions that encode engineering discipline into every prompt. I wrote about this approach in depth in AI Writes Code, You Own the Design, the core idea is that freeform prompting produces inconsistent output, while skill-encoded prompting produces output that follows a contract.
claude --print --model opus --max-turns 30 \
"Use the ygs-wbs skill approach:
1. Explore the codebase
2. Decompose into vertical-slice tasks
3. Write PLANS/{issue-slug}-{number}-plan.md with acceptance criteria"If you-got-skills is installed on the worker, Claude discovers /ygs-wbs as a slash command automatically. The prompt-embedded version works either way, no dependency on the skills package being present.
The four skills that shape this pipeline:
| Phase | Skill | What it enforces |
|---|---|---|
| plan | ygs-wbs | Vertical slices, acceptance criteria, explicit scope |
| implement | ygs-implement | Atomic commits, tests after each task, scope guardrails |
| fix-tests | ygs-investigate | Root cause analysis, not symptom masking |
| self-verify | ygs-code-review | Run tests, check correctness, fix critical issues |
Each skill acts as a contract. “Plan vertically, commit atomically, stop when blocked” produces far more consistent and reviewable output than open-ended instructions.
Decision 8: Make the Dashboard Show What’s Happening
Formicary’s job description field accepts markdown. Every submitted implement job carries clickable links to the issue, branch, and PR:
{
"job_type": "ai-gh-implement",
"description": "#42: Fix login timeout | [org/repo](https://github.com/org/repo)",
"params": {
"IssueLink": "[#42: Fix login timeout](https://github.com/org/repo/issues/42)",
"BranchLink": "[ai/42-fix-login-a3f1](https://github.com/org/repo/tree/ai/42-fix-login-a3f1)",
"PRLink": ""
}
}The PRLink starts empty and the create-pr task populates it once the PR exists. Every job in the dashboard now shows exactly what it’s working on with one-click navigation to the relevant GitHub page.
Decision 9: Capture Everything as Artifacts
Every task uploads artifacts with when: always including on failure. This is what makes debugging possible rather than a guessing game:
| Artifact | Contents |
|---|---|
plan_conversation.txt | Full AI conversation during planning |
plan_result.json | Status, complexity, task count, summary |
impl_conversation.txt | Full AI conversation during implementation |
impl_result.json | Status, files changed, commit count |
impl_diff.patch | Complete git diff of all changes |
impl_commits.txt | List of commits made |
test_output.txt | Test suite output with pass/fail details |
verify_result.json | Test pass/fail, critical findings, any fixes |
verify_conversation.txt | Full AI conversation during self-verify |
Every task also sets report_stdout: true, Formicary streams output to the dashboard websocket in real time. Combined with tee, you see the full AI conversation live as it happens. The workspace also persists locally at ~/claude_workspace/{issue}-{nonce} so you can cd into it after a run and inspect exactly what happened.
Decision 10: Monitor PRs and Capture Learnings
Imperative bots typically run a PR comment poller that fires every few minutes, scanning for mentions. I replaced it with a task inside the implement job that lives as long as the PR stays open:

The monitor-pr task:
- Polls for new PR review comments every 2 minutes
- Feeds each new comment to Claude, applies the change, commits, and pushes
- Replies on the PR confirming the fix
- Tracks processed comment IDs in
$WS/.processed_commentsto avoid re-processing - Exits when the PR merges or closes
The learn task runs after the PR closes. It reviews all PR comments, reviewer feedback, and the implementation conversation, then writes a structured learning entry to ~/claude_workspace/learn_context/ using the ygs-learn skill methodology: what went well, what to improve, patterns to remember for this codebase. Over time the agent gets better at this specific repo, not just better in general.
- task_type: monitor-pr method: SHELL timeout: 24h - task_type: learn method: SHELL # reviews PR feedback, writes to ~/claude_workspace/learn_context/
Decision 11: Support Multiple Trackers with Minimal Changes
The pipeline is intentionally tracker-agnostic. Only two tasks touch the issue tracker API: gather-issues in the picker, and create-pr plus monitor-pr in the implement job. Everything else: plan, implement, unit-test, self-verify, learn works identically regardless of tracker.
To support Jira and Bitbucket, I cloned the YAML files and swapped six commands:
gh issue list->acli jira search --jql ...gh issue edit->acli jira issue updategit clone git@github.com:->git clone git@bitbucket.org:gh pr create->acli bitbucket pr creategh pr view->acli bitbucket pr getgh api .../comments->acli bitbucket pr comment list
Result: ai-jira-issue-picker.yaml and ai-jira-implement.yaml, the same complete pipeline, different API calls. Both use the Atlassian CLI (acli) configured at ~/.config/acli/config.json.
What Formicary Gives You Without Writing a Line
When I started applying Formicary to agentic coding, I wasn’t sure it had everything needed. It had almost all of it already:
- Cron: scheduling with 7-field syntax (including seconds)
- Per-task timeouts: the feature imperative bots most consistently lack
- Exit-code routing (
on_exit_code): conditional DAG without custom code always_run: true: guaranteed cleanup regardless of failure- Artifact: passing between tasks via S3
- Encrypted secrets: with automatic log redaction
max_concurrency: capacity management declared in YAMLretry+delay_between_retries: automatic backoff- Go template functions: variable substitution in scripts
- SHELL executor: runs on a laptop with no Kubernetes
- KUBERNETES executor production-grade pod-per-task isolation
- Markdown in job descriptions: visible, clickable in the dashboard
Two additions were made specifically for this use case.
Native Kubernetes secret injection. The naive pattern passes API keys through the orchestrator as template variables, which stores them in the job definition. The new pattern lets the kubelet inject them at pod start time, the value never touches Formicary:
container:
image: ghcr.io/formicary-ai/agent-worker:latest
env_from:
- secret_ref: claude-bedrock-settings
- secret_ref: ai-agent-secretsOr for a single named key:
container:
env_value_from:
- name: ANTHROPIC_API_KEY
secret_name: ai-agent-secrets
key: anthropic-api-keyPer-task service accounts work the same way for IRSA on AWS or Workload Identity on GCP:
container: service_account: ai-agent-irsa-sa
CountByJobTypeAndState template function. The original capacity check made an HTTP API call requiring a token, an available endpoint, and network round-trip time. The new function queries the job database directly at the scheduler level before any worker is allocated:
skip_if: >-
{{if ge (CountByJobTypeAndState "ai-gh-implement" "PENDING" "EXECUTING") 10}} true {{end}}If the count hits the threshold, Formicary skips the entire job invocation with zero cost. The script also does a fine-grained check using the configurable MaxPendingJobs variable. Two layers: cheap early termination at the scheduler, tunable limits inside the task.
The Numbers
| Metric | Imperative Bot | Formicary Declarative |
|---|---|---|
| Lines of orchestration code | ~50,000 LOC | ~700 lines YAML |
| State machine states | 12+ | 0 (implicit in DAG) |
| Custom pollers | Multiple | 0 |
| Per-phase timeouts | None | Yes, per-task |
| Context between phases | Lost (new pod, new session) | Preserved via artifact chain |
| Runs locally without K8s | No | Yes (SHELL executor) |
| K8s isolation in production | Pod-per-job | Pod-per-task |
| Time to add a new phase | Hours to days | Minutes (copy task block, change prompt) |
| Restart safety | Branch conflicts | Nonce-based, no conflicts |
| Real-time output | Text logs only | Dashboard streaming + tee |
| Diagnostics on failure | Text logs | Full AI conversations + diffs as artifacts |
| Capacity check cost | HTTP API call | DB query at scheduler level |
| Verification | Limited | unit-test + self-verify (separate AI session) |
| Multi-tracker | One tracker hardcoded | Clone YAML, swap 6 commands |
| Continuous learning | None | learn task after every PR close |
| Secret injection | Env vars on host | Native Kubernetes env_from / env_value_from |
Getting Started
Option A: SHELL executor (local dev, fastest path)
This is where to start. The SHELL executor runs scripts directly on the host and inherits ~/.claude/settings.json, gh auth login, and all other host credentials automatically, no secrets configuration needed.
# 1. Prerequisites (one-time)
npm install -g @anthropic-ai/claude-code
gh auth login
# 2. Start Formicary (queen + embedded ant worker)
docker pull plexobject/formicary
docker run plexobject/formicary
# 3. Deploy workflow definitions
git clone https://github.com/bhatti/formicary.git
cd docs/examples
./deploy-ai-workflows.sh --mode shell --repo your-org/your-repo --setup-labels
# 4. Set org config so the picker knows where to look
curl -X POST http://localhost:7777/api/orgs/default/configs \
-H 'Content-Type: application/json' \
-d '{"name":"GitHubOrg","value":"your-org"}'
curl -X POST http://localhost:7777/api/orgs/default/configs \
-H 'Content-Type: application/json' \
-d '{"name":"GitHubRepo","value":"your-repo"}'
# 5. Label an issue — the picker fires within 1 minute
gh issue edit 1 --repo your-org/your-repo --add-label "ai-ready"
# 6. Watch it run
open http://localhost:7777Option B: Kubernetes with Bedrock via Tailscale
Pods can’t resolve Tailscale hostnames by name, but they can reach the IP. Resolve it once:
TAILSCALE_IP=$(python3 -c "import socket; print(socket.gethostbyname('ai'))")
kubectl create namespace formicary-ai
kubectl create secret generic claude-bedrock-settings \
--namespace=formicary-ai \
--from-literal=ANTHROPIC_BEDROCK_BASE_URL=http://${TAILSCALE_IP}/bedrock \
--from-literal=CLAUDE_CODE_USE_BEDROCK=1 \
--from-literal=CLAUDE_CODE_SKIP_BEDROCK_AUTH=1 \
--from-literal=ANTHROPIC_DEFAULT_OPUS_MODEL=us.anthropic.claude-opus-4-6-v1 \
--from-literal=ANTHROPIC_DEFAULT_SONNET_MODEL=us.anthropic.claude-sonnet-4-6 \
--from-literal=ANTHROPIC_DEFAULT_HAIKU_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
kubectl create secret generic ai-agent-secrets \
--namespace=formicary-ai \
--from-literal=github-token=$(gh auth token)If the Tailscale IP changes, regenerate the secret with --dry-run=client -o yaml | kubectl apply -f -.
Option C: Standard Anthropic API key
kubectl create secret generic ai-agent-secrets \ --from-literal=anthropic-api-key=sk-ant-... \ --from-literal=github-token=$(gh auth token)
Job YAMLs reference it with env_value_from, so the key is injected by the kubelet and never passes through Formicary.
Ten Lessons
- Timeouts are not optional. AI models hang. Give every phase its own timeout. A global job timeout is not a substitute when the plan phase hangs, you want to retry that phase, not restart the whole job from scratch.
- Structured JSON output unlocks routing. Ask the AI to output
{"status": "DONE|BLOCKED|TESTS_FAILING", ...}on its final line. Route on that field. Extract metadata for dashboards. - Flow context forward. If planning and implementation run in separate sessions with no shared artifacts, the implementer re-explores the entire codebase and discards all planning work. Pass
PLAN.mdas an artifact. Cost and quality both improve. - Use nonces for idempotency. Branch names, workspace paths, artifact names, all need a per-run nonce. Never reuse a name across retry attempts.
- Guarantee cleanup. Set
always_run: trueon cleanup tasks. Workspaces and branches accumulate fast. One stuck job should not leave garbage forever. - Let the orchestrator manage capacity. Set
max_concurrencyon the job and useskip_ifwith a scheduler-level DB query. Don’t write custom capacity management code, it will be wrong. - Skills are the real leverage. The quality gap between freeform prompting and methodology-encoded prompting is large. Invest in skill definitions. The skill is a contract: “plan vertically, commit atomically, stop when blocked.” Consistent contracts produce consistent, reviewable output. I covered this in depth in AI Writes Code, You Own the Design.
- Declarative wins operationally. Adding a security review phase to the declarative version takes minutes: copy a task block, write a prompt, add an
on_completedroute. The same change to an imperative system takes days. The asymmetry grows with every phase you add. - Capture everything on failure. Upload artifacts with
when: always. When something fails, you want the full AI conversation, the git diff, and the test output — not just “job failed.” - Build a feedback loop. Most AI coding systems run, merge, and forget. The
learntask after every PR close gives the agent a memory of what works and what doesn’t in this specific codebase. Over time, that compounds.
References
- Building Production-Grade AI Agents with MCP and A2A
- Building a Production-Grade Enterprise AI Platform with vLLM
- Agentic AI for Personal Productivity: Building a Daily Minutes Assistant with RAG, MCP, and ReAct
- Agentic AI for Automated PII Detection with LangChain and Vertex AI
- Agentic AI for API Compatibility with LangChain and LangGraph
- AI Writes Code, You Own the Design
- Building a Distributed Orchestration and Graph Processing System (Formicary)
The job definitions described in this post are in docs/examples/ in the Formicary repository. See docs/ai-agents.md for the full setup guide.