When Copying Kills Innovation: My Journey Through Software’s Cargo Cult Problem

October 15, 2025

When Copying Kills Innovation: My Journey Through Software’s Cargo Cult Problem

Filed under: Computing — admin @ 11:35 am

Back in 1974, physicist Richard Feynman gave a graduation speech at Caltech about something he called “cargo cult science.” He told a story about islanders in the South Pacific who, after World War II, built fake airstrips and control towers out of bamboo. They’d seen cargo planes land during the war and figured if they recreated what they saw—runways, headsets, wooden antennas—the planes would come back with supplies. They copied the appearance but missed the substance. The planes never came. Feynman used this to describe bad research—studies that look scientific on the surface but lack real rigor. Researchers going through the motions without understanding what makes science actually work.

Software engineering does the exact same thing. I’ve been doing this long enough to see the pattern repeat everywhere: teams adopt tools and practices because that’s what successful companies use, without asking if it makes sense for them. Google uses monorepos? We need a monorepo. Amazon uses microservices? Time to split our monolith. Kubernetes is what “real” companies use? Better start writing YAML.

In my previous post, I wrote about how layers of abstraction have made software too complex. This post is about a related problem: we’re not just dealing with necessary complexity—we’re making things worse by cargo culting what other companies do. We build the bamboo control towers and wonder why the planes don’t land. This is cargo cult software development, and I am sharing what I’ve learned here.

Executive Stack Envy

Executives suffer from massive stack envy. The executive reads about scalability of Kafka so suddenly we need Kafka. Never mind that we already have RabbitMQ and IBM MQSeries running just fine. Then another executive decides Google Pub/Sub is “more cloud native.” Now we have four message queues. Nobody provides guidance on how to use any of them. I watched teams struggle with poisonous messages for weeks. They’d never heard of dead letter queues.

On the database side, it’s the same pattern. In the early 2000s, I saw everyone rushed to adopt object-oriented databases like Versant and ObjectStore but they were proved to be short lived. At one company, leadership bet everything on a graph database. When customers came, scalability collapsed. We spent the next six years migrating away—not because migration was inherently hard, but because engineers built an overly complex migration architecture. Classic pattern: complexity for promotion, not for solving problems.

Meanwhile, at another company: we already had CloudSQL. Some teams moved to AlloyDB. Then an executive discovered Google Spanner. Now we have three databases. Nobody can explain why. Nobody knows which service uses which. At one company, we spent five years upgrading everything to gRPC. Created 500+ services. Nobody performance tested any of it until a large customer signed up. That’s when we discovered the overhead—gRPC serialization, microservice hops, network calls—it all compounded.

The Sales Fiction

Sales promised four nines availability, sub-100ms latency, multi-region DR. “Netflix-like reliability.” Reality? Some teams couldn’t properly scale within a single region. The DR plan was a wiki page nobody tested. Nobody understood the dependencies.

The Complexity Tax

Every service needs monitoring, logging, deployment pipelines, load balancing, service mesh config. Every network call adds latency and failure modes. Every distributed transaction risks inconsistency [How Abstraction is Killing Software: A 30-Year Journey Through Complexity].

The Monorepo That Ate Our Productivity

At one company, leadership decided we needed a monorepo “because Google uses one.” They’d read about how Google Chrome’s massive codebase benefited from having all dependencies in one place. What they missed was that Google has hundreds of engineers dedicated solely to tooling support.

Our reality? All services—different languages, different teams—got crammed into one repository. The promise was better code sharing. The result was forced dependency alignment that broke builds constantly. A simple package update in one service would cascade failures across unrelated services. Build times ballooned to over an hour and engineers spent endless hours fighting the build system.

The real kicker: most of our services only needed to communicate through APIs. We could have used service interfaces, but instead we created compile-time dependencies where none should have existed. At my time at Amazon, we handled shared code with live version dependencies that would trigger builds only when actually affected. There are alternatives—we just didn’t explore them.

Blaze Builds and the Complexity Tax

The same organization then adopted Bazel (Google’s open-sourced version of Blaze). Again, the reasoning was “Google uses it, so it must be good.” Nobody asked whether our small engineering team needed the same build system as Google’s tens of thousands of engineers. Nobody calculated the learning curve cost. Nobody questioned whether our relatively simple microservices needed this level of build sophistication. The complexity tax was immediate and brutal. New engineers took weeks to understand the build system. Simple tasks became complicated. Debugging build failures required specialized knowledge that only a few people possessed. We’d traded a problem we didn’t have for a problem we couldn’t solve.

The Agile Cargo Cult

I’ve watched dozens of companies claim they’re “doing Agile” while missing every principle that makes Agile work. They hold standups, run sprints, track velocity—all the visible rituals. The results? Same problems as before, now with more meetings.

Standups That Aren’t

At one company, “daily standups” lasted 30 minutes. Each developer gave a detailed status report to their manager. Everyone else mentally checked out waiting their turn. Nobody coordinated. It was a status meeting wearing an Agile costume.

The Velocity Obsession

Another place tracked velocity religiously. Management expected consistent story points every sprint. When velocity dropped, teams faced uncomfortable questions about “productivity.” Solution? Inflate estimates. Break large stories into tiny ones. The velocity chart looked great. The actual delivery? Garbage. Research shows teams game metrics when measured on internal numbers instead of customer value.

Product Owners Who Aren’t

I’ve seen “Product Owners” who were actually project managers in disguise. They translated business requirements into user stories. Never talked to customers. Couldn’t make product decisions. Spent their time tracking progress and managing stakeholders. Without real product ownership, teams build features nobody needs. The Agile ceremony continues, the product fails.

Copying Without Understanding

The pattern is always the same: read about Spotify’s squads and tribes, implement the structure, wonder why it doesn’t work. They copied the org chart but missed the culture of autonomy, the customer focus, the experimental mindset. Or they send everyone to a two-day Scrum certification. Teams return with a checklist of activities—sprint planning, retrospectives, story points—but no understanding of why these matter. They know the mechanics, not the principles.

Why It Fails

The academic research identified the problem: teams follow practices without understanding the underlying principles. They cancel meetings when the Scrum Master is absent (because they’re used to managers running meetings). They bring irrelevant information to standups (because they think it’s about reporting, not coordinating). They wait for task assignments instead of self-organizing (because autonomy is scary). Leadership mandates “Agile transformation” without changing how they make decisions or interact with teams. They want faster delivery and better predictability—the outcomes—without the cultural changes that enable those outcomes.

The Real Problem

True Agile requires empowering teams to make decisions. Most organizations aren’t ready for that. They create pseudo-empowerment: teams can choose how to implement predetermined requirements. They can organize their work as long as they hit the deadlines. They can self-manage within tightly controlled boundaries.

Platform Engineering and the Infrastructure Complexity Trap

Docker and Kubernetes are powerful tools. They solve real problems. But here’s what nobody talks about: they add massive complexity, and most organizations don’t have the expertise to handle it. I watched small startup adopt Kubernetes. They could have run on services directly EC2 instances. Instead, they had a three-node cluster, service mesh, ingress controllers, the whole nine yards.

Platform Teams That Made Things Worse

Platform engineering was supposed to make developers’ lives easier. Instead, I’ve watched platform teams split by technology—the Kubernetes team, the Terraform team, the CI/CD team—each making things harder. The pattern was consistent: they’d either expose raw complexity or build leaky abstractions that constrained without simplifying. One platform team exposed raw Kubernetes YAML to developers, expecting them to become Kubernetes experts overnight.

The fundamental problem? Everyone had to understand Kubernetes, Istio, Terraform, and whatever else the platform team used. The abstractions leaked everywhere. And the platform teams didn’t understand what the application teams were actually building—they’d never worked with the gRPC services they were supposed to support. The result was bizarre workarounds. One team found Istio was killing their long-running database queries during deployments. Their solution? Set terminationDrainDuration to 2 hours. They weren’t experts in Istio, so instead of fixing the real problem—properly implementing graceful shutdown with query cancellation—they just cranked the timeout to an absurd value.

When something broke, nobody could tell if it was the app or the platform. Teams burned days or weeks debugging through countless layers of abstraction.

The Microservices Cargo Cult

Every company wants microservices now. It’s modern, it’s scalable, it’s what Amazon does. I’ve watched this pattern repeat across multiple companies. They split monoliths into microservices and get all the complexity without any of the benefits. Let me tell you what I’ve seen go wrong.

Idempotency? Never Heard of It

At one company, many services didn’t check for duplicate requests resulting in double charges or incorrect balances. Classic non-atomic check-then-act: check if transaction exists, then create it—two separate database calls. Race condition waiting to happen. Two requests hit simultaneously, both check, both see nothing, both charge the customer. Same pattern everywhere I looked. I wrote about these antipatterns in How Duplicate Detection Became the Dangerous Impostor of True Idempotency.

The Pub/Sub Disaster

At another place, Google Pub/Sub had an outage. Publishers timed out, retried their events. When Pub/Sub recovered, both original and retry got delivered—with different event IDs. Duplicate events everywhere. Customer updates applied twice. Transactions processed multiple times. The Events Service was built for speed, not deduplication. Each team handled duplicates their own way. Many didn’t handle them at all. We spent days manually finding data drift and fixing it. No automated reconciliation, no detection—just manual cleanup after the fact.

No Transaction Boundaries

Simple database joins became seven network calls across services. Create order -> charge payment -> allocate inventory -> update customer -> send notification. Each call a potential failure point. Something fails midway? Partial state scattered across services. No distributed transactions, no sagas, just hope. I explained proper implementation of transaction boundaries in Transaction Boundaries: The Foundation of Reliable Systems.

Missing the Basics

But the real problem was simpler than that. I’ve seen services deployed without:

Proper health checks. Teams reused the same shallow check for liveness and readiness. Kubernetes routed traffic to pods that weren’t ready.
Monitoring and alerts. Services ran in production with no alarms. We’d find out about issues from customer complaints.
Dependency testing. Nobody load tested their dependencies. Scaling up meant overwhelming downstream services that couldn’t handle the traffic.
Circuit breakers. One slow service took down everything calling it. No timeouts, no fallbacks.
Graceful shutdown. Deployments dropped requests because nobody coordinated shutdown timeouts between application, Istio, and Kubernetes.
Distributed tracing. Logs scattered across services with no correlation IDs. Debugging meant manually piecing together what happened from nine different log sources.
Backup and recovery. Nobody tested their disaster recovery until disaster struck.

The GRPC Disaster Nobody Talks About

Another organization went all-in on GRPC for microservices. The pitch was compelling: better performance, strongly typed interfaces, streaming support. What could go wrong? Engineers copied GRPC examples without understanding connection management. Nobody grasped how GRPC’s HTTP/2 persistent connections work or the purpose of connection pooling. Services would start before the Istio sidecar was ready. Application tries an outbound GRPC call—ECONNREFUSED. Pod crashes, Kubernetes restarts it, repeat. The fix was one annotation nobody added: sidecar.istio.io/holdApplicationUntilProxyStarts: "true".

Shutdown was worse. Kubernetes sends SIGTERM, Istio sidecar shuts down immediately, application still draining requests. Dropped connections everywhere. The fix required three perfectly coordinated timeout values:

Application shutdown: 40s
Istio drain: 45s
Kubernetes grace period: 65s

Load balancing was a disaster. HTTP/2 creates one persistent connection and multiplexes all requests through it. Kubernetes’ round-robin load balancing works at the connection level. Result? All traffic to whichever pod got the first connection. Health checks were pure theater. Teams copied the same probe definition for both liveness and readiness. Even distinct probes were “shallow”—a database ping that doesn’t validate the service can actually function. Services marked “ready” that immediately 500’d on real traffic.

The HTTP-to-GRPC proxy layer? Headers weren’t properly mapped between protocols. Auth tokens got lost in translation. Customer-facing errors were cryptic GRPC status codes instead of meaningful messages. I ended up writing detailed guides on GRPC load balancing in Kubernetes, header mapping, and error handling. These should have been understood before adoption, not discovered through production failures.

The Caching Silver Bullet That Shot Us in the Foot

“Just add caching” became the answer to every performance problem. Database slow? Add Redis. API slow? Add CDN. At one company, platform engineering initially didn’t support Redis. So application teams spun up their own clusters. No standards. No coordination. Just dozens of Redis instances scattered across environments, each configured differently. Eventually, platform engineering released Terraform modules for Redis. Problem solved, right? Wrong. They provided the infrastructure with almost no guidance on how to use it properly. Teams treated it as a magic performance button.

What Actually Happened

Teams started caching without writing fault-tolerant code. One service had Redis connection timeouts set to 30 seconds. When Redis became unavailable, every request waited 30 seconds to fail. The cascading failures took down the entire application. Another team cached massive objects—full customer balances, assets, events, transactions, etc. Their cache hydration on startup took 10 minutes. Every deploy meant 10 minutes of degraded performance while the cache warmed up. Auto-scaling was useless because new pods weren’t ready to serve traffic. Nobody calculated cache invalidation complexity. Nobody considered memory costs. Nobody thought about cache coherency across regions.

BiModal Hell

The worst part? BiModal logic. Cache hit? Fast. Cache miss? Slow. Cold cache? Everything’s slow until it warms up. This obscured real problems—race conditions, database failures—because performance was unpredictable. Was it slow because of a cache miss or because the database was dying? Nobody knew. I’ve documented more of these war stories—cache poisoning, thundering herds, memory leaks, security issues with unencrypted credentials. The pattern was always the same: reach for caching before understanding the actual problem.

Infrastructure as Code: The Code That Wasn’t

“We do infrastructure as code” was the proud claim at multiple companies I’ve worked at. The reality? Terraform or AWS CloudFormation templates existed, sure. But some of the infrastructure was still being created through admin console, modified through scripts, and updated through a mix of manual processes and half-automated pipelines. The worst part was the configuration drift. Each environment—dev, staging, production—was supposedly identical. In reality, they’d diverged so much that bugs would appear in production that were impossible to reproduce in staging. The CI/CD pipelines for application code ran smoothly, but infrastructure changes were often applied manually or through separate automation. Database migrations lived completely outside the deployment pipeline, making rollbacks impossible. One failed migration meant hours of manual recovery.

The Platform Engineering “Solution” That Made Everything Worse

At one platform engineering org, they provided reusable Terraform modules but required each application team to maintain their own configs for every environment. The modules covered maybe 50% of what teams actually needed, so teams built custom solutions, and created snowflakes. The whole point—consistency and maintainability—was lost.

The brilliant solution? A manager built a UI to abstract away Terraform entirely. Just click some buttons!It was a masterclass in leaky abstractions. You couldn’t do anything sophisticated, but when it broke, you had to understand both the UI’s logic AND the generated Terraform to debug it. The UI became a lowest-common-denominator wrapper inadequate for actual needs. I’ve seen AWS CDK provide excellent abstraction over CloudFormation—real programming language power with the ability to drop down to raw resources when needed. That’s proper abstraction: empowering developers, not constraining them. This UI understood nothing about developer needs. It was cargo cult thinking: “Google has internal tools, so we should build internal tools!” I’ve learned: engineers prefer CLI or API approaches to tooling. It’s scriptable, automatable, and fits into workflows. But executives see broken tooling and think the solution is slapping a UI on it—lipstick on a pig. It never works.

The Config Drift Nightmare

We claimed to practice “config as code.” Reality? Our config was scattered across:

Git repos (three different ones)
AWS Parameter Store
Environment variables set manually
Hardcoded in Docker images
Some in a random database table
Feature flags in LaunchDarkly
Secrets in three different secret managers

Dev environment had different configs than staging, which was different from production. Not by design—by entropy. Each environment had been hand-tweaked over years by different engineers solving different problems. Infrastructure changes were applied manually to environments through separate processes, completely bypassing synchronization with application code. Database migrations lived in four different directory structures across services, no standard anywhere.

Feature flags were even worse. Some teams used LaunchDarkly, others ZooKeeper, none integrated with CI/CD. Instead of templating configs or inheriting from a base, we maintained duplicate configs for every single environment. Copy-paste errors meant production regularly went down from missing or wrong values.

Feature Flags: When the Safety Net Becomes a Trap

I have seen companies buy expensive solutions like LaunchDarkly but fail to provide proper governance and standards. Google’s outage showed exactly what happens: a new code path protected by a feature flag went untested. When enabled, a nil pointer exception took down their entire service globally. The code had no error handling. The flag defaulted to ON. Nobody tested the actual conditions that would trigger the new path. I’ve seen the same pattern repeatedly. Teams deploy code behind flags, flip them on in production, and discover the code crashes. The flag was supposed to be the safety mechanism—it became the detonator. Following are a few common issues related to feature flags that I have observed:

No Integration

Flag changes weren’t integrated with our deployment pipeline. We treated them as configuration, not code. When problems hit, we couldn’t roll back cleanly. We’d deploy old code with new flag states, creating entirely new failure modes. No canary releases for flags. Teams would flip a flag for 100% of traffic instantly. No phased rollout. No monitoring the impact first. Just flip it and hope.

Misuse Everywhere

Teams used flags for everything: API endpoints, timeout values, customer tier logic. The flag system became a distributed configuration database. Nobody planned for LaunchDarkly being unavailable.

I’ve documented these antipatterns extensively—inadequate testing, no peer review, missing monitoring, zombie flags that never get removed. The pattern is always the same: treat flags as toggles instead of critical infrastructure that needs the same rigor as code.

The Observability Theater

At one company, they had a dedicated observability team monitoring hundreds of services across tens of thousands of endpoints. Sounds like the right approach, doesn’t it? The reality was they couldn’t actually monitor at that scale, so they defaulted to basic liveness checks. Is the service responding with 200 OK? Great, it’s “monitored.” We didn’t have synthetic health probes so customers found these issues before the monitoring did. Support tickets were our most reliable monitoring system.

Each service needed specific SLOs, custom metrics, detailed endpoint monitoring. Instead, we got generic dashboards and alerts that fired based on a single health check for all operations of a service. The solution was obvious: delegate monitoring ownership to service teams while the platform team provides tools and standards.

The Security Theater Performance

We had SOC2 compliance, which sales loved to tout. Reality? Internal ops and support had full access to customer data—SSNs, DOBs, government IDs—with zero guardrails and no auditing. I saw list APIs returned everything including SSNs, dates of birth, driver’s license numbers—all in the response. No field-level authorization. Teams didn’t understand authentication vs authorization. OAuth? Refresh tokens? “Too complicated.” They’d issue JWT tokens with 12-24 hour expiration. Session hijacking waiting to happen. Some teams built custom authorization solutions. Added 500ms latency to every request because they weren’t properly integrated with data sources. Overly complex permission systems that nobody understood. When they inevitably broke, services went down.

The Chicken Game

Most companies play security chicken. Bet on luck rather than investment. “We haven’t been breached yet, so we must be fine.” Until they’re not. The principle of least privilege? Never heard of it. I saw everyone in Devops teams gets admin access because it’s easier than managing permissions properly.

AI Makes It Worse

With AI, security got even sloppier. I’ve seen agentic AI code that completely bypasses authorization. The AI has credentials, the AI can do anything. No concept of user context or permissions. The Salesloft breach showed exactly what happens: their AI chatbot stored authentication tokens for hundreds of services—Salesforce, Slack, Google Workspace, AWS, Azure, OpenAI. Attackers stole them all. One breach, access to everything. Standards like MCP (Model Context Protocol) aren’t designed with security in mind. They give companies a false sense of security while creating massive attack surfaces. AI agents with broad access, minimal auditing, no principle of least privilege.

Training vs Reality

But we had mandatory security training! Eight hours of videos about not clicking phishing links. Nothing about secure coding, secret management, access control, or proper authentication. Nothing about OAuth flows, token rotation, or session management. We’d pass audits because we had the right documents. Incident response plans nobody tested. Encryption “at rest” that was just AWS defaults we never configured.

The On-Call Horror Show

Let me tell you about the most broken on-call setup I’ve seen. The PagerDuty escalation went: Engineer -> Head of Engineering. That’s it. No team lead, no manager, just straight from IC to executive.

The Escalation Disaster

New managers? Not in the escalation chain. Senior engineers? Excluded. Other teams skipped layers entirely—engineer to director, bypassing everyone in between. When reorganizations happened, escalation paths didn’t get updated. People left, new people joined, and PagerDuty kept paging people who’d moved to different teams or left the company entirely. Nobody had proper governance. No automated compliance checks. Escalation policies drifted until they bore no resemblance to the org chart.

Missing the Basics

Many services had inadequate SLOs and alerts defined. Teams would discover outages from customer complaints because there was no monitoring. The services that did have alerts? Engineers ignored them. Lower environment alerts went to Slack channels nobody read. Critical errors showed up in staging logs, but no one looked. The same errors would hit production weeks later, and everyone acted surprised. “This never happened before!” It did. In dev. In staging. Nobody checked.

Runbooks and Shadowing

I have seen many teams didn’t keep runbooks up to date. New engineers got added to on-call rotations without shadowing experienced people. One person knew how to handle each class of incident. When they were unavailable, everyone else fumbled through it.

We had the tool the “best” companies used, so we thought we must be doing it right.

The Remote Work Hypocrisy

I’ve been working remotely since 2015, long before COVID made it mainstream. When everyone went remote in 2020, I thought finally companies understood that location doesn’t determine productivity. Then came the RTO (Return to Office) mandates. CEOs talked about “collaboration” and “culture” while most team members were distributed across offices anyway. Having 2 out of 10 team members in the same office doesn’t create collaboration—it creates resentment.

I watched talented engineers leave rather than relocate. Companies used RTO as voluntary layoffs, losing their best people who had options. The cargo cult here? Copying each other’s RTO policies without examining their own situations.

Startups with twenty people and no proper office facilities demanded RTO because big tech was doing it. They had no data on productivity impact, no plan for making office time valuable, just blind imitation of companies with completely different contexts.

The AI Gold Rush

The latest cargo cult is AI adoption. CEOs mandate “AI integration” without thinking through actual use cases. I’ve watched this play out repeatedly.

The Numbers Don’t Lie

95% of AI pilots fail at large companies. McKinsey found 42% of companies using generative AI abandoned projects with “no significant bottom line impact.” But executives already got their stock bumps and bonuses before anyone noticed.

What Actually Fails

I’ve seen companies roll out AI tools with zero training. No prompt engineering guidance. No standardized tools—just a chaotic mess of ChatGPT, Claude, Copilot, whatever people found online. No policies. No metrics. Result? People tried it, got mediocre results, concluded AI was overhyped. The technology wasn’t the problem—the deployment was. Budget allocation is backwards. Companies spend 50%+ on flashy sales and marketing AI while back-office automation delivers the highest ROI. Why? Investors notice the flashy stuff.

The Code Quality Disaster

Here’s what nobody talks about: AI is producing mountains of shitty code. Most teams haven’t updated their SDLC to account for AI-generated code. Senior engineers succeed with AI; junior engineers don’t. Why? Because writing code was never the bottleneck—design and architecture are. You need skill to write proper prompts and critically review output. I’ve used Copilot since before ChatGPT, then Claude, Cursor, and a dozen others. They all have the same problems: limited context windows mean they ignore existing code. They produce syntactically correct code that’s architecturally wrong.

I’ve been using Claude Code extensively. Even with detailed plans and design docs, long sessions lose track of what was discussed. Claude thinks something is already implemented when it isn’t. Or ignores requirements from earlier in the conversation. The context window limitation is fundamental.

Cargo Cult Adoption

I’ve worked at companies where the CEO mandated AI adoption without defining problems to solve. People got promoted for claiming “AI adoption” with useless demos. Hackathon demos are great for learning—actual production integration is completely different. Teams write poor abstractions instead of using battle-tested frameworks like LangChain and LangGraph. They forget to sanitize inputs when using CrewAI. They deploy agents without proper context engineering, memory architecture, or governance.

At one company I worked at, we deployed AI agents without proper permission boundaries—no safeguards to ensure different users got different answers based on their access levels. The Salesforce breach showed what happens when you skip this step. Companies were reusing the same auth tokens in AI prompts and real service calls. No separation between what the AI could access and what the user should see.

The 5% That Work

The organizations that succeed do it differently:

Buy rather than build (67% success rate vs 33%)
Start narrow and deep—one specific problem done well
Focus on workflow integration, not flashy features
Actually train people on how to use the tools
Define metrics before deployment

The Productivity Theater

Companies announce layoffs and credit AI, but the details rarely add up. IBM’s CEO claimed AI replaced HR workers—viral posts said 8,000 jobs. Reality? About 200 people, and IBM’s total headcount actually increased. Klarna was more honest. Their CEO publicly stated AI helped shrink their workforce 40%—from 5,527 to 3,422 employees. But here’s the twist: they’re now hiring humans back because AI-driven customer service quality tanked. Builder.ai became a $1.5 billion unicorn claiming their AI “Natasha” automated coding. Turned out it was 700 Indian developers manually writing code while pretending to be AI. The company filed for bankruptcy in May 2025 after exposing not just the fake AI, but $220 million in fake revenue through accounting fraud. Founders had already stepped down.

Why This Is Dangerous

Unlike previous tech hype, AI actually works for narrow tasks. That success gets extrapolated into capabilities that don’t exist. As ACM notes about cargo cult AI, we’re mistaking correlation for causation, statistical patterns for understanding. AI can’t establish causality. It can’t reason from first principles. It can’t ask “why.” These aren’t bugs—they’re fundamental limitations of current approaches. The most successful AI deployments treat it as a tool requiring proper infrastructure: context management, semantic layers, memory architecture, governance. The 95% that fail skip all of this and wonder why their chatbot doesn’t work.

Breaking Free from the Cult

After years of watching this pattern, I’ve learned to recognize the warning signs:

The Name Drop: “Google/Amazon/Netflix does it this way” The Presentation: Slick slides, no substance The Resistance: Questioning is discouraged The Metrics: Activity over outcomes The Evangelists: True believers who’ve never seen it fail

The antidote is simple but not easy:

Ask Why: Not just why others do it, but why you should
Start Small: Pilot programs reveal problems before they metastasize
Measure Impact: Real metrics, not vanity metrics
Listen to Skeptics: They often see what evangelists miss
Accept Failure: Admitting mistakes early is cheaper than denying them

The Truth About Cargo Cult Culture

After living through all this, I’ve realized cargo cult software engineering isn’t random. It’s systematic. It starts at the top with executives who believe that imitating success is the same as achieving it. They hire from big tech not for expertise, but for credibility. “We have ex-Google engineers!” becomes the pitch, even if those engineers were junior PMs who never touched the systems they’re now supposed to recreate.

These executives enable sales and marketing to sell fiction. “Fake it till you make it” becomes company culture. Engineering bears the burden of making lies true, burning out in the process. The engineers who point out that the emperor has no clothes get labeled as “not team players.” The saddest part? Some of these companies could have been successful with honest, appropriate technology choices. But they chose cosplay over reality, form over function, complexity over simplicity.

The Way Out

I’ve learned to spot these situations in interviews now. When they brag about their tech stack before mentioning what problem they solve, I run. When they name-drop companies instead of explaining their architecture, I run. When they say “we’re the Uber of X” or “we’re building the next Google,” I run fast.

The antidote isn’t just asking “why” – it’s demanding proof. Show me the metrics that prove Kubernetes saves you money. Demonstrate that microservices made you faster. Prove that your observability actually prevents outages. Most can’t, because they never measured before and after. They just assumed newer meant better, complex meant sophisticated, and copying meant competing.

Your context is not Google’s context. Your problems are not Amazon’s problems. And that’s okay. Solve your actual problems with boring, appropriate technology. Your customers don’t care if you use Kubernetes or Kafka or whatever this week’s hot technology is. They care if your shit works. Stop building bamboo airports. Start shipping working software.

Shahzad Bhatti Welcome to my ramblings and rants!

October 15, 2025