The Hidden Cost of Building AI Agents with LangChain, CrewAI, and AutoGen
You found LangChain six months ago and built something impressive in a weekend. Or maybe it was CrewAI, or AutoGen, or the OpenAI Assistants API. The demo worked. Your team was excited. Then you tried to put it in production.
This is the story of almost every engineering team that has tried to build AI agents from a framework.
What the frameworks are actually good at
Let's be honest about what LangChain, CrewAI, and AutoGen do well:
LangChain is excellent for rapid prototyping. Its chain abstractions make it easy to wire LLM calls together, and its massive library of integrations means you can connect to almost anything. If you need to build a proof-of-concept quickly and demonstrate a concept to stakeholders, LangChain is fast.
CrewAI makes multi-agent orchestration approachable. The "role, goal, backstory" pattern for defining agents is intuitive, and seeing multiple agents collaborate on a task is genuinely compelling in a demo.
AutoGen (now AG2) is strong for experimental multi-agent research and complex reasoning chains. Microsoft's investment in the framework shows — it handles sophisticated conversational patterns that simpler frameworks can't.
None of these criticisms are wrong. The frameworks work. The problem is what happens after the demo.
The production gap
Building an AI agent that works in a demo and building an AI agent that works reliably in production for real business users are two very different problems.
Here is what the frameworks don't handle:
Human-in-the-loop approvals — when the agent decides to send an email, book a meeting, update a CRM record, or execute any other real-world action, how does a human review and approve that action? The frameworks don't have a first-class approval primitive. You build it yourself. This ends up being a significant project — you need a UI, a notification system, a state machine for tracking pending approvals, and a way to get the approval decision back to the agent.
Credential management — your agents need access to real integrations: Gmail, GitHub, HubSpot, Slack. Managing OAuth tokens, rotating credentials, handling token refresh, and ensuring tokens are scoped correctly is plumbing work. Lots of it.
Observability — when an agent fails, what do you look at? LangSmith (LangChain's tracing tool) helps with chain-level visibility, but it's another service to manage, another cost to track, and it only covers LangChain. If you're using multiple frameworks, you're stitching together multiple observability tools.
Tenant isolation — if you're building a product where multiple organizations each get their own agent, you need to ensure data never crosses tenant boundaries. Every integration call, every memory store read, every tool invocation needs to be scoped to the right tenant. Implementing this correctly is not hard, but it requires discipline and discipline is hard to maintain as the codebase grows.
Reliability and retries — LLMs fail. APIs rate-limit. Network calls time out. Production agent systems need retry logic, fallback models, graceful degradation, and clear error propagation. The frameworks give you primitives; the operational logic is yours to build.
Versioning and rollback — when an agent behavior changes because you updated a prompt or added a tool, how do you roll back? How do you test changes before they affect production users? The frameworks treat agents as code, not as deployable services with versioning semantics.
The maintenance tax
The frameworks move fast. LangChain has had multiple breaking changes between major versions. If your production system is pinned to an old version, you miss security fixes and new model support. If you upgrade, you often discover that APIs changed in ways that break your code.
A team at a B2B SaaS company recently shared their experience: they had three engineers who each understood different parts of their LangChain-based pipeline. When one left, the institutional knowledge of why certain workarounds existed disappeared with them. The codebase had accumulated months of patches for LangChain version incompatibilities, half-implemented retry logic, and a custom approval UI that worked well enough to not get fixed.
This is not an unusual story. It is the median outcome for teams that choose to own their agent infrastructure.
The build-vs-buy calculation
Before choosing to build with a framework, run the honest calculation:
What you're building:
- The agent itself (the valuable part — domain logic, prompts, outputs)
- The approval workflow
- The integration connectors and credential management
- The observability stack
- The tenant isolation layer
- The reliability and retry infrastructure
- The versioning and deployment system
- The HITL notification system
What you're maintaining indefinitely:
- Framework version upgrades
- Integration connector updates as APIs change
- Prompt engineering as model behaviors shift
- Infrastructure that grows as usage grows
For most teams, the non-agent infrastructure accounts for 60–80% of the engineering effort over the first year. The actual business logic — the part that creates value — is a fraction of the work.
The alternative
A purpose-built agent platform handles the infrastructure layer so your team builds only the valuable part.
Ariftly gives you:
- Human-in-the-loop approvals built into the protocol — no custom approval UI to build
- Pre-built integrations (GitHub, Gmail, Slack, Jira, HubSpot) with managed credentials
- Complete observability with event sourcing — every state change logged and replayable
- Tenant isolation by default — scoped to your organization
- Reliability infrastructure — retries, fallbacks, graceful degradation
- Versioned, deployable agents with rollback capability
If your use case is AI Readiness compliance or B2B sales outreach, you can deploy the vertical agent — built, tested, and running in production — in 10 minutes. No framework setup, no infrastructure plumbing, no approval workflow to build.
If your use case is something custom, the Remote Agent Protocol lets you build an agent in any language and register it on the platform. You write the domain logic; the platform handles everything else.
When to use a framework
Frameworks are the right choice when:
- You're building something experimental that doesn't need to be reliable or multi-tenant
- Your use case is so unusual that no existing agent covers it, and you're willing to invest in the infrastructure
- You're a research team that values flexibility over operational stability
- You want to contribute to the open-source ecosystem
Frameworks are the wrong choice when:
- You need this in production in weeks, not quarters
- You need multiple people to approve AI actions before they execute
- You're a small team with limited infrastructure bandwidth
- Your domain is one of the verticals that a purpose-built agent already covers
The framework vs. platform decision is fundamentally about where you want to spend your engineering time. Building on a framework is a bet that the infrastructure problems are worth solving yourself. Choosing a platform is a bet that the business logic — the thing that actually differentiates you — is where your time is better spent.
For most teams building real-world AI agents in 2026, that bet is the platform.