Category
blog
Date
Mar 2, 2026
Author
Yaron Schneider
table of Content

Still Not Durable:

How Microsoft Agent Framework and Strands Agents Repeat the Same Mistakes

In the first part of this series, I walked through how LangGraph, CrewAI, and Google ADK all offer checkpointing and resumability features that fall far short of durable execution. The core problem: they save state, but leave failure detection, automatic recovery, and duplicate prevention entirely to you.

Now I'm completing my durability deep dive series with Microsoft's Agent Framework (the Semantic Kernel + AutoGen successor) and AWS's Strands Agents. Both are backed by major cloud providers. Both have persistence stories. And both repeat the same fundamental mistake.

Let's look at each.

Microsoft Agent Framework: Superstep Checkpoints That Still Need You to Push the Button

How It Works

Microsoft's Agent Framework — the unification of Semantic Kernel and AutoGen, currently in public preview with GA expected March 2026 — has the most explicitly designed checkpointing system of any agent framework I've reviewed.

Workflows execute in supersteps: the same Pregel computation model that LangGraph uses under the hood. At the end of each superstep, the runtime snapshots the entire workflow state into a WorkflowCheckpoint:

@dataclass
class WorkflowCheckpoint:
    workflow_name: str
    graph_signature_hash: str       # SHA-256 of the workflow topology
    checkpoint_id: str
    previous_checkpoint_id: str     # Checkpoints form a chain
    messages: dict[str, list]       # In-flight messages between executors
    state: dict[str, Any]           # All executor states + shared state
    pending_request_info_events: dict  # Human-in-the-loop pending requests
    iteration_count: int

Each executor (agent, sub-workflow, custom logic) can save and restore its own state via lifecycle hooks:

class MyExecutor(Executor):
    async def on_checkpoint_save(self) -> dict[str, Any]:
        return {"messages": self._messages, "cache": self._cache}

    async def on_checkpoint_restore(self, state: dict[str, Any]) -> None:
        self._messages = state.get("messages", [])
        self._cache = state.get("cache", [])

Storage backends include InMemoryCheckpointStorage (development) and FileCheckpointStorage (disk-based with atomic writes via temp-file + os.replace). To resume, you pass a checkpoint_id to the workflow's run() method:

# Resume from a specific checkpoint
async for event in workflow.run_stream(
    checkpoint_id=saved_checkpoint.checkpoint_id
):
    ...

The framework even validates that the workflow graph topology hasn't changed since the checkpoint was created, rejecting mismatches via the graph_signature_hash.

On paper, this is well-engineered. In practice, it has the same gap as every other framework.

Where It Breaks Down

Resume is entirely manual. Looking at the actual source code, the restore_from_checkpoint method in the Runner class requires the caller to provide an explicit checkpoint_id. There is no supervisor, no scheduler, no automatic restart. If your process crashes mid-workflow, the checkpoint sits in storage until something external decides to use it. At scale — hundreds of concurrent workflows — you need to build the entire detection-and-retry infrastructure yourself.

No automatic failure detection. The workflow runner has no heartbeat, no lease mechanism, no watchdog. A crashed workflow is indistinguishable from a running one unless you build monitoring on top. The Runner class tracks _resumed_from_checkpoint as a boolean, but only to avoid creating a duplicate initial checkpoint — not to detect or recover from failures.

No duplicate execution prevention. Two processes can resume the same checkpoint_id simultaneously. There is no distributed lock, no lease, no fencing token. The InMemoryCheckpointStorage is a simple dict; the FileCheckpointStorage uses atomic file writes but has no concurrency control across processes.

Checkpoint granularity is superstep-level, not activity-level. If a superstep contains three executors running in parallel and two succeed before the third fails, all three re-execute on resume. The checkpoint captures the state at the previous superstep boundary. Work completed within the current superstep is lost.

The "Durable Task Extension" requires Azure. Microsoft does offer a durable task extension that provides true crash recovery, distributed failover, and automatic restart. It's technically in the open-source repo. But it depends on durabletask-azuremanaged — which requires the Azure Durable Task Scheduler as a backend. There is no self-hosted, cloud-agnostic option. If you're not on Azure, the production-grade durability story doesn't apply to you.

No built-in retry for tool failures. Individual tool call failures surface as exceptions from RunAsync. There is no RetryPolicy, no automatic backoff, no fallback routing. You implement retry via middleware or try/catch patterns — which means your retry logic itself isn't durable.

The checkpoint system is well-designed as a building block. But a building block is not a runtime guarantee. You still need to detect failures, coordinate retries, prevent duplicates, and handle partial superstep work — all at scale, all durably. Microsoft is at least honest about this: the production answer is their Azure Durable Task Extension, which is essentially Durable Functions for agents. But that's an Azure service, not a framework capability. Plus, it can carry a very high cost for many.

Strands Agents: Per-Message Persistence Without a Safety Net

How It Works

Strands Agents, AWS's open-source Python SDK (used internally by Amazon Q Developer), takes a different approach from checkpoint-based frameworks. Instead of snapshotting state at discrete intervals, it persists every message as it happens.

The SessionManager registers itself as a hook listener on the agent's lifecycle events:

def register_hooks(self, registry: HookRegistry) -> None:
    registry.add_callback(AgentInitializedEvent,
        lambda event: self.initialize(event.agent))
    registry.add_callback(MessageAddedEvent,
        lambda event: self.append_message(event.message, event.agent))
    registry.add_callback(MessageAddedEvent,
        lambda event: self.sync_agent(event.agent))
    registry.add_callback(AfterInvocationEvent,
        lambda event: self.sync_agent(event.agent))

Every time the agent produces a response or receives a tool result, a MessageAddedEvent fires, which triggers both append_message (persist the message) and sync_agent (persist the full agent state). If an agent loop involves five tool calls, that's ten messages persisted — five model responses and five tool results — plus ten agent state syncs.

Storage backends include FileSessionManager (local disk with atomic writes), S3SessionManager (production), and a pluggable SessionRepository interface for custom backends.

To resume a session, you create a new agent with the same session_id:

# After failure — new process, same session_id
agent = Agent(
    agent_id="assistant",
    session_manager=FileSessionManager(
        session_id="user-abc-123",
        storage_dir="./sessions"
    ),
)
# Agent automatically loads full conversation history from storage
agent("Continue where we left off")

For multi-agent patterns (Graph, Swarm, Workflow), the orchestrator state is persisted after every node call via AfterNodeCallEvent hooks. Graph state includes completed nodes, failed nodes, results, and the next nodes to execute.

Where It Breaks Down

Conversation restore is not execution resume. This is the critical distinction. When a crashed agent is re-initialized with the same session_id, Strands restores the full message history — but the event loop starts from scratch. If the agent was on step 5 of a 10-step tool-calling sequence when it crashed, steps 1–4's messages are in the conversation context, but the agent doesn't know it was mid-execution. It sees the conversation history and starts a new inference cycle. The LLM might repeat work, skip steps, or take an entirely different path. There is no replay of tool executions — it's conversational context, not execution state.

Graph failures reset to the beginning. Looking at the actual deserialize_state code in the Graph implementation, if a graph execution failed (as opposed to being interrupted by the user), the framework resets everything:

def deserialize_state(self, payload: dict[str, Any]) -> None:
    if not payload.get("next_nodes_to_execute"):
        # FAILED with no path forward: reset everything
        for node in self.nodes.values():
            node.reset_executor_state()
        self.state = GraphState()
        self._resume_from_session = False
        return

Failed graphs don't resume from the failure point. They start over. All completed node work within that execution is discarded.

No retry at any level of the stack. Graph execution is fail-fast — any node failure stops the entire graph and re-raises the exception. Swarm execution is the same: a failed node sets Status.FAILED and breaks the loop. The only built-in retry is ModelRetryStrategy, which handles exclusively ModelThrottledException (HTTP 429s from the model API). Tool failures, network errors, downstream service outages — none of these trigger any automatic retry:

# From _retry.py — only throttling gets retried
if not isinstance(event.exception, ModelThrottledException):
    return  # No retry for any other exception type

No automatic failure detection or restart. There is no supervisor, no health check, no lease mechanism. The community knows this — Issue #1138 ("Agent State Management — Snapshot, Pause, and Resume") explicitly calls out: "Long-running workflows cannot be paused and resumed (requiring continuous execution), agent state cannot be preserved across system restarts or deployments, and there is no ability to create checkpoints during critical operations."

Session state doesn't scale. Per-message persistence means the stored state grows linearly with conversation length. Issue #1230 reports 7-second load times for sessions with 500+ messages and 2GB memory pressure from deserializing full histories. The proposal is to adopt a LangGraph-style checkpoint model — which would be an improvement in efficiency, but wouldn't solve any of the durability problems.

FileSessionManager is not thread-safe. Concurrent writes to the same session_id can corrupt session data. Production deployments require S3 or a custom database backend — and even then, there's no distributed locking to prevent duplicate recovery attempts.

Strands' per-message persistence is useful for maintaining conversational context across sessions. But conversational context is not execution state. Knowing what the agent said is very different from knowing what it did — and guaranteeing it finishes.

The Pattern, Unfortunately, Holds

Five frameworks. Five variations on the same theme.

Framework Persistence Model Auto Failure Detection Auto Resume Distributed Execution
LangGraph Superstep checkpoints (Pregel) No No No (OSS)
CrewAI Task replay + @persist No No No
Google ADK Event-sourced sessions No No No
MS Agent Framework Superstep checkpoints (Pregel) No No No (Azure only)
Strands Agents Per-message sessions No No No

Every framework saves state. None of them guarantee completion.

Microsoft comes closest to acknowledging the gap — their Durable Task Extension provides real durable execution semantics. But it's an Azure service, not a framework primitive. And it still requires you to opt into a specific cloud platform and runtime model, which can become very costly at scale with considerable latency implications.

The fundamental problem remains: checkpointing is a storage operation, not a reliability guarantee. Writing state to disk (or S3, or Cosmos DB, or PostgreSQL) is the easy part. The hard part is the orchestration layer that detects failures, automatically restarts workflows from the right point, prevents duplicate executions, and does all of this at scale without you writing a single line of recovery code.

That's what durable execution provides. That's what Dapr Workflows provides. And that's what these agent frameworks are missing. The main blocker with using existing workflow engines like Dapr is that they don't have the necessary deep integration with the agent framework's internal engine and lifecycle events — but that's about to change.

The agent frameworks got the developer experience right. Now the industry needs to get reliability right — and it won't get there by adding better checkpointers.

More blog posts

AI Agents
Jan 21, 2026
 min read
January 21, 2026

Agent Identity: The Foundational Layer that AI Is Still Missing

AI agents are not magic. They are workflows with probabilistic decision-making and real-world privileges. Workflows need identity, policy, and governance, especially when they run across environments, clouds, and tools.

Josh Van Leeuwen
AI Agents
Feb 25, 2026
 min read
February 25, 2026

Checkpoints Are Not Durable Execution:

Why LangGraph, CrewAI, Google ADK, and Others Fall Short for Production Agent Workflows

Yaron Schneider
News
AI Agents
Dec 15, 2025
 min read
December 15, 2025

Diagrid Joins the Agentic AI Foundation as a Gold Member

We’re excited to announce that Diagrid has officially joined the Agentic AI Foundation (AAIF) as a Gold Member, underscoring our commitment to driving open standards and production-grade infrastructure for reliable AI agents.

Yaron Schneider

Diagrid newsletter

Signup for the latest Dapr & Diagrid news:
AI Agents