Your Agents Need a Contract

A spec-driven architecture for enterprise AI agents using LangGraph as the execution engine. Declarative YAML schemas (Agent, AgentWorkflow, KnowledgeBase) as Kubernetes-style CRDs with conformance profiles, policy-as-code enforcement, and GitOps deployment. Includes a nine-framework ADK comparison (LangGraph vs Google ADK vs OpenAI Agents SDK vs AutoGen vs Bedrock vs Claude SDK vs CrewAI vs Agno), production scorecard, and downloadable spec templates.
Nidhi VichareApril 1, 2026
19 min read
The InferenceLangGraphLangChainAI AgentsEnterprise AIData GovernancePlatform ArchitectureCDOAI StrategyGitOps

TL;DR

The spec is the product. A declarative specification layer—Kubernetes-style YAML schemas—turns the definition of an AI agent into a version-controlled, auditable, deployable asset. Three schema types (Agent, AgentWorkflow, KnowledgeBase) are the canonical contract.

LangGraph is the execution engine. Agents as state machines, not prompt chains. Five structural differences matter: typed state schemas with reducers, nodes as pure functions, conditional edges with explicit routing, automatic checkpointing at every step, and a cross-thread memory store. No other framework ships all five.

The platform play: The platform that owns the spec layer, the translation layer, and the execution layer wins the enterprise.


Your agents need a contract — declarative spec layer over LangGraph

Why spec-driven architecture

The current approach to building agents is imperative. You write Python (or C#, or TypeScript), you instantiate classes, you wire tools in code, you deploy a container. The agent's definition is entangled with its implementation. This creates six problems that compound at scale: no standardization across teams, poor governance and auditing, configuration drift between environments, high operational complexity per agent, barriers to collaboration between prompt engineers and platform teams, and incompatibility with GitOps.

Every agent becomes a snowflake. Every migration becomes a rewrite.

The enterprise problem is not "pick a framework." It is: how do you govern, audit, and migrate a fleet of agents that each define their own structure?

The alternative is to treat the definition of an AI system as a first-class, version-controlled asset, separate from its runtime. Model it after the pattern that already won infrastructure: Kubernetes-style declarative schemas. Instead of writing code that does things, you write a specification that declares things. What the agent's role is. What model it uses. What tools it has access to. How it routes between steps. Where its state is persisted. What security policies apply.

The three schema types

Three schema types capture the full surface area of an enterprise AI system:

Agent

Declares a single AI agent: identity, role, goal, system prompt, LLM config, bound tools, knowledge base connections, and security guardrails. The contract for one unit of AI capability.

AgentWorkflow

Orchestrates multiple agents into a multi-step process: participants, topology (sequential, parallel, network), routing protocol, and step definitions. The contract for how agents collaborate.

KnowledgeBase

Defines a data source for RAG: data connection, ingestion pipeline, embedding strategy, retrieval config (including topK), and data security policies. The contract for grounded knowledge.

Each schema has three tiers of requirements: minimal (required to deploy anywhere), conditional (required when you opt into specific capabilities like MCP tools or A2A protocols), and optional (quality-of-life fields for cataloging and discovery).

What a spec looks like

apiVersion: agents.platform.io/v1
kind: Agent
metadata:
  name: order-fulfillment-agent
  team: commerce-platform
  version: 2.1.0
spec:
  role: "Order fulfillment coordinator"
  goal: "Validate payment, check inventory, dispatch shipping"
  systemPrompt: "You are an order fulfillment agent..."
  llm:
    model: gpt-4o
    gatewayRef: llm-gateway-prod
  tools:
    - name: payment_verify
      mcp:
        serverRef: payments-mcp
    - name: inventory_check
      type: function
  knowledgeBases:
    - ref: product-catalog-kb
  stateSchema:
    order_id: { type: string }
    payment_status: { type: string }
    inventory_result: { type: object }
    error_log: { type: array, reducer: append }
  checkpointer:
    type: postgres
    config: { pool_size: 5 }
  context:
    environment: production

A CDO reads this and understands: what model, what tools, what data, what team owns it, what environment it runs in. A platform engineer reads this and knows exactly how to deploy it. A compliance officer reads this and can assess the security posture. No code. No black boxes. One artifact that every stakeholder can read.


Why LangGraph as the execution engine

The spec layer needs an execution engine underneath. We chose LangGraph because it is built on a fundamentally different architectural idea: the agent is a state machine, not a prompt chain. Here are the five structural elements.

The five structural elements of a LangGraph workflow: state schema, nodes, edges, checkpointer, and entry config

1. Typed state schemas with reducers

Every LangGraph workflow starts with a state definition. Not a loose dictionary. A typed schema where each field has a declared type, a default value, and a reducer function that defines how concurrent updates merge. The reducer is the critical piece: replace overwrites, append concatenates lists, add sums numbers. Without reducers, parallel agent execution is a race condition. With them, it is deterministic.

from typing import TypedDict, Annotated
from langgraph.graph import add_messages

class WorkflowState(TypedDict):
    order_id: str
    payment_status: str
    inventory_result: dict
    shipping_label: str
    messages: Annotated[list, add_messages]  # reducer
    error_log: Annotated[list, lambda a, b: a + b]

2. Nodes as computation units

Each node is a function that receives current state, does something (calls an LLM, invokes a tool, runs business logic), and returns a partial state update. Nodes do not know about each other. They only know about state. Five node types: agent (LLM-powered), tool_node (executes tool calls), function (arbitrary code), subgraph (nested workflow), human (pauses for human input).

def validate_payment(state: WorkflowState) -> dict:
    result = payment_api.verify(state["order_id"])
    return {"payment_status": result.status}

def check_inventory(state: WorkflowState) -> dict:
    result = inventory_api.check(state["order_id"])
    return {"inventory_result": result}

def dispatch_shipping(state: WorkflowState) -> dict:
    label = shipping_api.create_label(state["order_id"])
    return {"shipping_label": label}

3. Edges with conditional routing

Static edges always go A to B. Conditional edges evaluate a routing function against current state and decide which node runs next. This is explicit, testable code. Not an LLM deciding what to do. The mapping dict makes the routing logic completely transparent.

def route_on_payment(state) -> str:
    if state["payment_status"] == "approved":
        return "check_inventory"
    return "exception_review"

def route_on_inventory(state) -> str:
    if state["inventory_result"].get("in_stock"):
        return "dispatch_shipping"
    return "backorder_queue"

4. Checkpointing as a first-class feature

Every time a node completes, LangGraph saves a checkpoint: a complete snapshot of the graph state. This is automatic, not opt-in. Four capabilities flow from this: human-in-the-loop (pause and resume at any node), crash recovery (resume from last checkpoint, not step 1), time-travel debugging (replay from any historical checkpoint), and fault tolerance (successful nodes' outputs preserved in parallel fan-outs).

5. Entry point and configuration

graph_type is either state_graph (typed state dict flows through every node) or message_graph (chat-style). recursion_limit prevents runaway loops. entry_point is the first node to execute after __start__.

See it run: order fulfillment workflow

This diagram shows an order fulfillment spec executing as a LangGraph state machine. The happy path flows through payment validation, inventory check, shipping dispatch, and order confirmation. The failed path routes through human-in-the-loop exception review. Every node auto-saves a checkpoint to Postgres.

Order fulfillment workflow as a LangGraph state machine with conditional routing and checkpoint indicators

Happy path: start → validate_payment → check_inventory → dispatch_shipping → confirm_order → end (5 checkpoints saved)

Failed + HITL path: start → validate_payment → exception_review (HITL pause) → end (3 checkpoints saved, workflow pauses for human approval at the review node)

The conditional diamond after validate_payment evaluates route_on_payment(state). If the payment clears, the workflow continues through the top path. If the payment is flagged, it routes to the human review node, which triggers an interrupt: before that pauses the entire workflow until an operator approves or rejects.

Building it: from state to compiled app

from langgraph.graph import StateGraph, END

graph = StateGraph(WorkflowState)
graph.add_node("validate_payment", validate_payment)
graph.add_node("check_inventory", check_inventory)
graph.add_node("dispatch_shipping", dispatch_shipping)
graph.add_node("confirm_order", confirm_order)
graph.add_node("exception_review", exception_review)
graph.set_entry_point("validate_payment")
graph.add_conditional_edges(
    "validate_payment", route_on_payment,
    {"check_inventory": "check_inventory",
     "exception_review": "exception_review"})
graph.add_edge("dispatch_shipping", "confirm_order")
graph.add_edge("confirm_order", END)

Compile with a Postgres checkpointer for persistence. Every invocation needs a thread_id. That ID is your handle for crash recovery, time travel, and HITL.

from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver

checkpointer = AsyncPostgresSaver.from_conn_string(
    "postgresql://...:5432/workflows-prod")

app = graph.compile(checkpointer=checkpointer)

config = {"configurable": {
    "thread_id": "order-ORD-88421"}}

result = await app.ainvoke(
    {"order_id": "ORD-88421"},
    config=config)

# Crash at dispatch_shipping? Restart with same thread_id.
# Resumes from last checkpoint, not step 1.

Where LangGraph sits in the landscape

The agentic AI framework space has exploded. By March 2026, every major AI company ships an Agent Development Kit. They all let you define tools, give the LLM a reasoning loop, and maintain some form of memory. They differ on one question: what happens when your agent fails at step 7 of a 12-step workflow at 2 AM?

1. LangChain AI (independent, VC-backed) — MIT

LangGraph v1.0.10 GA. Python + JS/TS. Graph-based state machine for agentic workflows. Auto checkpointing, human-in-the-loop, time travel debug, crash recovery, model agnostic, config-driven. Also offers LangChain (chains/RAG), LangSmith (observability), LangGraph Platform (managed deploy). Fully vendor-neutral. The only ADK with production-grade state persistence across all dimensions.

2. Google — Apache 2.0

Google Agent Development Kit (ADK) v1.26.0. Python. Hierarchical agent tree optimized for Gemini + Google Cloud. Best multimodal story (bidirectional audio/video streaming). Native A2A protocol, native MCP, Vertex AI deploy. Session-level state only, no checkpointing. Powers Google's internal Agentspace. Newest major ADK, still maturing.

3. OpenAI — MIT

OpenAI Agents SDK v0.10.2. Python. Lightweight handoff-based agent delegation. Lowest learning curve, built-in tracing, guardrails, 100+ models. No checkpointing, ephemeral state. Replaced the experimental Swarm framework. Fastest path from zero to working agent. Falls apart for long-running processes or anything requiring persistence.

4. Microsoft — MIT

AutoGen (AG2): AG2 rewrite. Python. Conversational multi-agent debate. Model agnostic but non-deterministic and hard to debug. Semantic Kernel: Stable. C#, Python, Java. Enterprise planner + plugins. Broadest connectors, Azure native, but verbose with a high learning curve. Two complementary ADKs. Neither has automatic checkpointing.

5. Amazon Web Services — Managed + Apache 2.0

Bedrock Agents: Managed service via console + CloudFormation. Zero ops, AWS compliance, but Bedrock models only and vendor lock-in. Strands Agents: New (Mar 2026). Python. Code-first, open-source, model agnostic, low learning curve, but very new. Both lack developer-controlled checkpointing.

6. Anthropic — MIT

Claude Agent SDK v0.1.48. Python. MCP-native tool-use chains with safety-first design. Anthropic invented MCP (Model Context Protocol). Tightest MCP integration, in-process server model, lifecycle hooks. Computer use capabilities. Locked to Claude models, no checkpointing.

7. Independent / community — Open source

CrewAI v1.10.1. Python. Role-based multi-agent crews. 44.6k stars, MCP + A2A, lowest learning curve, but limited checkpointing. Agno (ex-Phidata) v2. Python. Speed-first agent SDK. Fastest execution, model agnostic, but no checkpointing and no HITL. Neither has production-grade state management.

Production capabilities scorecard

Production capabilities scorecard comparing nine ADKs across ten enterprise dimensions

The color coding: green is strong native support, amber is partial or platform-specific, red is weak or missing, gray is not available. Read the LangGraph column top to bottom and you see a wall of green. No other ADK comes close across all dimensions. Individual frameworks win individual rows: Google ADK on multimodal, OpenAI SDK on MCP, Bedrock on zero-ops. But nobody else fills the state-and-persistence category, which is the one that matters at 2 AM when your agent is stuck at step 7.

The observability gap nobody talks about

The comparison shows observability as a dimension, but the depth difference deserves its own callout because it compounds every other gap.

LangGraph has native, zero-config integration with LangSmith. Set one environment variable and every node execution, tool call, state transition, and checkpoint is traced automatically with full graph-aware context. No callbacks to write, no spans to configure. On top of that, LangGraph has well-supported integrations with Langfuse (via callbacks) and Arize Phoenix (via OpenTelemetry), giving teams a choice between proprietary and open-source observability without writing adapter code.

Agno has observability, but the story is thinner. Agno connects to Langfuse and Arize through generic OpenTelemetry bridges (OpenLIT, OpenInference), which means more manual setup, less granular agent-specific tracing, and no proprietary observability platform of its own. AgentOps also supports Agno. The traces work, but they do not capture the structural richness that a graph-aware system like LangSmith extracts from LangGraph natively.

To be clear: Agno is not built on LangGraph. They are completely independent frameworks with different architectures. Agno's philosophy is "pure Python, no graphs, no chains," which gives it extraordinary instantiation speed and a low memory footprint. But that independence comes at a cost. Every observability platform, every evaluation tool, every tracing standard has to build and maintain a separate integration for Agno. And because LangGraph has the larger installed base in production, it gets those integrations first, with more depth, and with more community testing. Agno is always one step behind on the observability toolchain, not because its team is slow, but because it is a third-party framework competing for integration priority against a framework that owns its own observability platform.

This is the catch-up dynamic that affects every independent framework, not just Agno. CrewAI, AutoGen, and the OpenAI Agents SDK all face the same challenge: they depend on third-party observability tools to provide what LangGraph gets natively from LangSmith. When LangSmith ships a new feature (say, time-travel replay of checkpointed state), LangGraph users get it immediately. Everyone else waits for Langfuse or Arize to build an equivalent, and even then, the integration may not be as deep because the underlying framework does not expose the same structural data.

For enterprises evaluating ADKs, observability is not a checkbox. It is the layer that determines how quickly you can debug a failed workflow at 2 AM, how confidently you can audit agent behavior for compliance, and how much operational overhead each agent adds to your platform team. LangGraph's observability advantage is not just a feature comparison. It is a structural advantage that gets wider over time.

The bottom line

If you need... Use
Complex workflows with HITL and audit trails LangGraph
Fastest path to a working agent OpenAI SDK or CrewAI
Multimodal agents on Google Cloud Google ADK
Microsoft/.NET enterprise integration Semantic Kernel
Zero infrastructure management AWS Bedrock
Safety-first with MCP-native tools Anthropic Claude SDK
A production platform that runs for years LangGraph (not close)

The spec library

The Agent spec above is one of three schema types. This post ships four companion spec files you can use as starting templates. Each one is fully commented, production-realistic, and follows the conformance model described in this article:

  • Agent specspec-agent.yaml: single-agent declaration with model config, MCP tools, state schema with reducers, checkpointer, security guardrails, and operational limits.
  • Workflow specspec-workflow.yaml: sequential multi-step workflow with participants, conditional routing, a human-in-the-loop step, retry policies, and a Postgres checkpointer.
  • Team of Agents specspec-team-of-agents.yaml: network-topology team where four agents discover and invoke each other through the A2A protocol, with explicit dependency wiring and shared variables.
  • KnowledgeBase specspec-knowledgebase.yaml: RAG knowledge base with ingestion pipelines, chunking strategy, embedding config, retrieval with reranking, and security controls.

These are not toy examples. They represent the level of detail a production platform needs to validate, compile, and deploy an AI system from a single declarative file. Fork them, adapt them to your domain, and use them as the starting point for your own spec library.

Sample Agent spec — a fully declared AI agent in YAML
Sample AgentWorkflow spec — sequential multi-step orchestration
Sample Team of Agents spec — network topology with A2A protocol

From spec to runtime

The specification is runtime-independent. The same spec can be executed through three different pathways depending on the team's infrastructure maturity:

PATHWAY A

Runtime Interpretation

A generic containerized runtime reads the spec at startup and bootstraps itself. Same binary + different spec = different agent. No code generation needed.

Lowest friction path

PATHWAY B

Build-Time Transpilation

CI/CD compiles spec → framework source code. Generates a LangGraph StateGraph (or CrewAI crew, or Agno agent) as a build artifact. Spec stays source of truth.

Maximum code control

PATHWAY C

K8s-Native Reconciliation

Schemas become CRDs. An operator manages the agent's lifecycle: scaling, health checks, secret injection, environment promotion. The spec is the deployment manifest.

Cloud-native ops

All three pathways enforce the same contract. The spec is the invariant. The runtime is the variable.

Framework-agnostic adapters

This works because the spec maps cleanly to every major framework's primitives. systemPrompt maps to LangChain's SystemMessagePromptTemplate, to CrewAI's Role/Goal, to AutoGen's system_message. llm.* maps to any SDK's model client. tools[].mcp maps to MCP tool runners or HTTP shims. stateManagement maps to Redis or Postgres session stores. The mapping is mechanical, not creative.

Ship a lightweight adapter for each framework: one for LangChain, one for CrewAI, one for AutoGen, one for OpenAI, one for Google ADK, one for Agno. Each adapter takes a spec and returns a runnable agent in that framework's native primitives. Teams keep their favorite SDKs. The spec is the only contract that matters for deployment and operations.


Governing at enterprise scale

Conformance profiles: bronze, silver, gold

Not every agent needs the same level of governance. A proof-of-concept in a sandbox has different requirements than a production agent processing regulated data. Conformance profiles formalize this gradient:

🥉 Bronze — Minimal

Owner, role, goal, system prompt, model, environment tag. The floor — every agent must meet it.

🥈 Silver — Integrations

Gateway refs resolve. MCP tool refs resolve. KB bindings resolve. Cross-references validated. Catches wiring errors before prod.

🥇 Gold — Governed

URN/UUID for catalog. RBAC + egress allowlists + encryption. Telemetry redaction for PII/PHI. Mandatory for regulated data.

CI selects a profile per environment and enforces it as a gate. Admission controllers enforce it again at deploy time. The profile rises as the agent promotes from dev to staging to production.

Policy-as-code enforcement

The spec enables enforcement without a UI, without a portal, without a human reviewer in the loop. Two gates, both automated:

GATE 1

CI Gate

Schema validation + conformance tests. Owner present? Gateway resolves? MCP servers exist? Security fields for gold workloads? Build fails if spec is invalid.

GATE 2

Admission Gate

Deploy-time re-validation via OPA/Kyverno. Environment set? Gateways exist in namespace? MCP refs resolve? Deploy blocked if non-conformant.

Between the two gates, the spec is validated twice: once at build time against the source of truth, once at deploy time against the actual environment. No agent reaches production without meeting its conformance profile.

The GitOps golden path

The spec becomes the atomic unit of the agent's lifecycle. The repository layout is simple:

/agents/          # One Agent spec per file
/workflows/       # One AgentWorkflow spec per file
/knowledge/       # One KnowledgeBase spec per file
/overlays/{env}/  # Environment-specific patches (model SKUs, rate limits, secrets)

ArgoCD (or equivalent) syncs each environment namespace. Promoting an agent from staging to production means merging a PR that moves the spec into the production overlay. The spec diff is the deployment diff. The Git history is the audit trail. The approval workflow is the governance process. No click-ops. No manual configuration. No configuration drift.


Migration and the road ahead

This is where the spec layer pays for itself. When the directive comes down to consolidate from four frameworks to one, the migration path is mechanical rather than heroic:

For each existing agent, decompose it into its structural primitives: what state does it maintain, what nodes does it execute, how does it route, where does it persist. Write (or generate) the spec. Validate it against the appropriate conformance profile. Deploy it through the standard GitOps pipeline. The original framework code can be retired at the team's own pace because the spec, not the code, is now the source of truth.

Move 50 agents by moving 50 spec files. Not by rewriting 50 Python projects.

The convergence thesis

The frameworks will keep multiplying. Google will ship ADK 2.0. OpenAI will add persistence. Anthropic will open Claude SDK to third-party models. New entrants will appear. The framework layer is a commodity that fragments.

The specs will keep converging. Whether it is Kubernetes CRDs, YAML manifests, JSON schemas, or some future standard, the declarative description of an AI system is converging toward a common shape: identity, model, tools, state, routing, persistence, security. The vocabulary is stabilizing even as the implementations diverge.

The platform that owns the spec layer, the translation layer, and the execution layer wins the enterprise.

The spec layer captures intent. The translation layer compiles intent into runnable code. The execution layer runs it with checkpointing, observability, and governance. Each layer can evolve independently. Each layer can be swapped. But the platform that integrates all three, and enforces conformance across them, is the one that scales from one agent to a thousand.

LangGraph is the strongest execution layer available today. The spec layer is the moat.


Further reading


This post is part of "the inference," a series on enterprise AI strategy and architecture.

Build with conviction. Govern with discipline.

Nidhi Vichare