Category
blog
Date
Feb 25, 2026
Author
Yaron Schneider
table of Content

Checkpoints Are Not Durable Execution

Why LangGraph, CrewAI, Google ADK, and Others Fall Short for Production Agent Workflows

The AI agent ecosystem is moving fast. Frameworks like LangGraph, CrewAI, Google ADK, Strands and more have made it remarkably easy to prototype multi-step agent workflows - chain LLM calls, invoke tools, route between agents, and build impressive demos in an afternoon.

But there's a fundamental gap between "demo" and "production," and it shows up the moment something goes wrong.

All three frameworks offer some form of checkpointing and resumability. LangGraph has checkpointers and thread_id. CrewAI has @persist and task replay. Google ADK has SessionService and invocation_id-based resume. On the surface, these features look like they solve the durability problem. They don't.

What they actually give you is a **save point** - a snapshot of state that you, the developer, are responsible for detecting the need to use, manually triggering, and coordinating at scale to avoid duplicate work. That's a far cry from production-grade durability, where the runtime itself guarantees that your workflow runs to completion, automatically recovers from failures, and never loses progress, without you writing a single line of complex detection, recovery and remediation code.

In this post I’ll walk through three frameworks and explain how their "durability" features actually work, and why they break down under real production conditions.

---

LangGraph: Checkpointers That Require You to Be the Orchestrator

How It Works

LangGraph's persistence story revolves around **checkpointers**. When you compile a graph with a checkpointer (e.g., PostgresSaver, SqliteSaver), the runtime automatically saves a snapshot of the graph state at every superstep - each tick of the graph where one or more nodes execute.

Each checkpoint is scoped to a thread_id. To resume a failed graph, you call graph.invoke() with the same thread_id, and LangGraph picks up from the last successful superstep:

config = {"configurable": {"thread_id": "workflow-123"}}

try:
   result = graph.invoke({"query": "process this"}, config=config)
except Exception as e:
   # You detected the failure. Now you resume manually.
   result = graph.invoke(None, config=config)

There's also a RetryPolicy you can attach to individual nodes for automatic retries with exponential backoff:

builder.add_node(
   "call_api",
   call_api_fn,
   retry_policy=RetryPolicy(max_attempts=5, backoff_factor=2.0),
)

When retries are exhausted, the exception is raised. There is no built-in fallback routing, no dead-letter queue, no notification system. The error is persisted in the checkpoint, and that's it.

Where It Breaks Down

The fundamental problem is that **LangGraph makes you the orchestrator**. The checkpointer saves state, but:

  1. **No automatic failure detection.** If your process crashes, no one knows. There is no supervisor, no watchdog, no heartbeat mechanism. Your workflow is simply dead until something external notices. And by something external I mean you, the user.
  2. **No automatic resumption.** Once you detect the failure (how?), you need to call `invoke(None, config)` with the correct `thread_id`. At scale, hundreds or thousands of concurrent workflows mean building an entire failure-detection-and-retry infrastructure yourself.
  3. **No duplicate execution prevention.** If two processes try to resume the same `thread_id` simultaneously (entirely possible in a distributed system recovering from a partial failure), LangGraph has no built-in coordination to prevent both from executing. You're now responsible for distributed locking and lease coordination among others.
  4. **Single-process execution.** The open-source library runs in a single process. There is no distributed execution, no task queue, no worker pool. If that process dies, everything it was running dies with it.

Checkpointing is not production-grade durability. It's a low-level building block that shifts the hard problems onto you.

---

CrewAI: Replay and Persist — But Only for non-Autonomous agents, and Only If You Build the Recovery Yourself

How It Works

CrewAI has two separate persistence mechanisms, neither of which provides true durable execution.

**Task Replay** lets you re-run tasks from the most recent crew kickoff. Each crew.kickoff() saves task outputs locally, and you can replay from a specific task:

crewai replay -t <task_id>

**Flow Persistence** uses the @persist decorator to save flow state to SQLite after each successful method execution:

@persist
class MyFlow(Flow[MyState]):
   @start()
   def step_one(self):
       self.state.counter = 1

   @listen(step_one)
   def step_two(self):
       self.state.intermediate_results["step2"] = "result"

For human-in-the-loop scenarios, there's a from_pending() / resume() pattern:

flow = MyFlow.from_pending(flow_id="abc123")
result = flow.resume(feedback="approved")

Agent-level retries exist via max_retry_limit (default 2), max_iter (default 20), and max_execution_time, but these only handle LLM-level errors within a running process.

Where It Breaks Down 

CrewAI's persistence story has significant gaps at every level:

**Task replay only retains the last kickoff.** There is no historical record. If you're running hundreds of crews and need to identify which ones failed and replay them, you have no built-in mechanism to do so.

**`@persist` does not auto-resume.** It saves state after each successful step, but if your process crashes, nobody restarts the flow. You must detect the failure, find the flow's persisted ID, instantiate a new flow object, and manually route execution past the completed steps. The framework does not skip completed steps automatically — you have to add conditional logic to every method:

@listen(step_one)
def step_two(self):
    if self.state.step_completed >= 2:
        return  # You built this skip logic yourself
    # ... actual work ...

**No distributed execution.** CrewAI runs in a single process. There is no task queue, no worker pool, no placement logic. A crashed process means all in-flight crews are gone.

**No coordination for concurrent recovery.** If multiple processes attempt to resume the same flow, there's no locking or deduplication. You can easily end up with duplicate executions performing the same work — or worse, corrupting shared state.

CrewAI persistence/replay works at workflow boundaries (Flow methods / Crew tasks). It does not persist and resume the agent’s internal ReAct execution cursor (e.g., midway through iterative tool selection/execution) unless you explicitly model each tool/action as its own persisted step/task. This means CrewAI has no durability story for fully autonomous ReAct agents.

CrewAI Enterprise adds execution monitoring, but even on the enterprise tier there is no automatic restart of failed crew executions. The monitoring tells you something failed; you still have to fix it, and by fix it I mean writing extremely complex mechanisms for failure detection and recovery which themselves need to be durable.

---

Google ADK: Event Sourcing Without the Orchestrator

How It Works

Google ADK's session management is the most architecturally sophisticated of the three, built on an event-sourcing model. Every interaction - user messages, agent responses, tool calls and state changes is an immutable `Event` appended to the session history.

Persistence backends include InMemorySessionService (ephemeral), DatabaseSessionService (SQLite/PostgreSQL/MySQL), and VertexAiSessionService (managed Google Cloud).

Since version 1.14.0, ADK has explicit **resumability support** via ResumabilityConfig:

from google.adk.app import App
from google.adk.runtime import ResumabilityConfig

app = App(
   name='my_resumable_agent',
   root_agent=root_agent,
   resumability_config=ResumabilityConfig(is_resumable=True),
)

To resume a failed workflow, you re-invoke with the original invocation_id:

runner.run_async(
   user_id='u_123',
   session_id='s_abc',
   invocation_id='invocation-123'
)

The framework replays completed tool results from the event history and re-executes only the failed step. Different agent types handle resumes differently - SequentialAgent uses saved current_sub_agent, LoopAgent tracks times_looped, and ParallelAgent identifies completed sub-agents and runs only the unfinished ones.

ADK also provides a ReflectAndRetryToolPlugin (v1.16.0+) that intercepts tool failures, sends the error back to the LLM for reflection, and retries with corrected parameters.

Where It Breaks Down

Despite the more sophisticated architecture, the same fundamental gaps exist:

  1. **No automatic failure detection.** The caller must detect that a workflow was interrupted. There is no watchdog, no heartbeat, no health check built into the framework. The ADK documentation is explicit about this - the caller is responsible for detecting failure and re-invoking.
  2. **No automatic restart.** Once you detect the failure, you need to retrieve the correct invocation_id, construct the right API call or run_async invocation, and execute it. At scale, this means building your own retry infrastructure.
  3. **Tool failures can crash entire workflows.** This is a known and documented issue where an unhandled exception in a tool can propagate up and terminate the entire multi-agent workflow. The recommended mitigation is to return error dicts from tools instead of raising exceptions, but this is a convention, not a guarantee.
  4. **No distributed orchestration.** ADK itself has no built-in worker pool, task queue, or distributed scheduling. You can deploy on Vertex AI Agent Engine or GKE for infrastructure-level resilience, but the framework itself doesn't coordinate across instances.

Once again, you are left to create and maintain extremely complex infrastructure to take care of the hard problems that ADK's resumability feature leaves to you.

---

What Production Actually Requires: Durable Execution

The pattern across all major frameworks is the same: they give you **persistence primitives** and call it a day. The hard problems like failure detection, automatic recovery, guaranteed execution, distributed coordination - are all left to you.

This is the difference between checkpointing and durable execution.

**Checkpointing** says: "I saved your state. You take it from here."

**Durable execution** says: "Your agent workflows will run to completion. Period. I handle everything."

In a durable execution model like Dapr Workflows, the runtime provides fundamentally different guarantees:

  • **Automatic state persistence.** Every await point in your workflow is automatically a checkpoint. No explicit save calls, no decorator configuration, no choosing which steps to persist.
  • **Automatic failure recovery.** Before executing any workflow step, the runtime creates a durable reminder. If the process, Dapr, or even the entire cluster crashes, the reminder automatically reactivates the workflow and retries **indefinitely, without any human or external system intervention**. There is no failure to "detect" because recovery is built into the infrastructure.
  • **Replay-based resumption.** On recovery, the workflow function replays from the beginning, but completed activities return their stored results from the event log instead of re-executing. The workflow picks up exactly where it left off, with all local variables restored, as if nothing happened.
  • **Distributed execution.** Workflows and activities are distributed across cluster nodes via an actor placement service with consistent hashing. A single workflow can fan out activities across every node in the cluster. Node failures trigger automatic rebalancing.
  • **No recovery code.** You write your workflow as straightforward, linear code. The runtime handles all persistence, recovery, and coordination transparently. There is no try/catch/resume pattern, no conditional skip logic, no manual thread_id management.
def order_processing_workflow(ctx, order):
   # Each await is automatically a durable checkpoint
   inventory = yield ctx.call_activity(check_inventory, input=order)
   payment = yield ctx.call_activity(process_payment, input=order)
   shipment = yield ctx.call_activity(arrange_shipping, input=order)
   return shipment

If this workflow crashes after check_inventory completes but before process_payment finishes, the runtime automatically restarts the workflow. On replay, check_inventory returns its cached result instantly, and process_payment re-executes. No manual intervention, no external monitoring, no duplicate inventory checks.

How do you use Durable Execution with agent frameworks?

Using a durable execution solution like Dapr Workflows is very useful to bring durability guarantees and automatic recovery to any general purpose business logic, but requires tight integration into the underlying agent frameworks to be able to replace their brittle checkpointing mechanisms in a way that's transparent to the user. Do not fall for seemingly technical bits that show you different ways to take a simple LLM client, create a for loop, wrap methods in workflow activities and call it agentic workflows. This isn’t agentic durability, it’s a DIY marketing honeytrap that won’t work if you’re looking to use LangGraph, CrewAI, Strands, Google ADK, OpenAI AgentKit, Microsoft Agent Framework or any other popular framework. 

To that end, we are working on some groundbreaking technical advancements to be able to provide automatic recovery and guaranteed execution at scale to any agent framework with little to no code changes involved. You should also check out Dapr Agents, an agentic framework with built-in durable execution that is co-maintained with NVIDIA.

---

The Gap Is Structural, Not Incremental

This isn't about feature maturity. LangGraph, CrewAI, Google ADK and others are not on a trajectory toward durable execution. They would need to fundamentally rearchitect their runtimes to provide it. Adding a better checkpointer or a fancier retry policy doesn't close the gap. The gap is **between saving state and guaranteeing completion** and it requires a runtime that takes ownership of the workflow lifecycle, not one that hands you a snapshot and wishes you well.

For prototyping and demos, checkpointing is fine. For production workloads where agent workflows process orders, manage infrastructure, handle customer requests, or coordinate across services, where a silently failed workflow means lost revenue, corrupted data, or broken SLAs - you need durable execution.

The agent frameworks got the developer experience right. Now the industry needs to get reliability right.

More blog posts

AI Agents
Jan 21, 2026
 min read
January 21, 2026

Agent Identity: The Foundational Layer that AI Is Still Missing

AI agents are not magic. They are workflows with probabilistic decision-making and real-world privileges. Workflows need identity, policy, and governance, especially when they run across environments, clouds, and tools.

Josh Van Leeuwen
News
AI Agents
Dec 15, 2025
 min read
December 15, 2025

Diagrid Joins the Agentic AI Foundation as a Gold Member

We’re excited to announce that Diagrid has officially joined the Agentic AI Foundation (AAIF) as a Gold Member, underscoring our commitment to driving open standards and production-grade infrastructure for reliable AI agents.

Yaron Schneider
Dapr
AI Agents
Workflows
Dec 8, 2025
 min read
December 8, 2025

Understanding Dapr Actors for Scalable Workflows and AI Agents

Dapr’s virtual actor model brings object-oriented simplicity to cloud-native, distributed systems. Learn what actors are, how Dapr Actors handle state, placement, timers, and failures, and when they’re the right tool for the job.

Mark Fussell

Diagrid newsletter

Signup for the latest Dapr & Diagrid news:
AI Agents