Dapr University is live.Explore The Free Courses

What Happens When Your Strands Agent Crashes in Production?

AWS Strands + Diagrid

AWS Strands Agents persists every message as it happens through lifecycle event hooks and supports storage backends like S3SessionManager. But conversation history is not execution state. When a crashed agent is re-initialized with the same session_id, Strands restores the full message history but the event loop starts from scratch. The LLM may repeat work, skip steps, or diverge entirely from its previous execution path. Diagrid adds true durable execution so your Strands agents resume exactly where they left off, built on the open-source Dapr project and portable across any cloud.

True execution resumeCloud-agnostic deploymentBuilt on open-source Dapr

Production Gap Analysis

Why Strands Session Persistence Isn't Enough

Strands persists conversation context through per-message lifecycle hooks like MessageAddedEvent and AfterNodeCallEvent, with pluggable backends including FileSessionManager and S3SessionManager. But knowing what an agent said is fundamentally different from guaranteeing what it did. Production reliability requires execution state management, not just conversational context persistence.

Conversation restore is not execution resume

When a crashed Strands agent is re-initialized with the same session_id, the full message history is restored. But the event loop starts from scratch. The LLM sees prior messages and may repeat completed work, skip steps, or take an entirely different path. Restoring context is not the same as resuming execution.

Failed graph executions reset completely

If a Strands graph execution fails, the framework's deserialize_state resets everything and discards all completed node work within that execution. There is no way to resume from the last successful node. The entire graph starts over.

No automatic retry for real failures

The only built-in retry mechanism handles ModelThrottledException for HTTP 429 rate limits. Tool failures, network errors, and service outages trigger no automatic recovery. Both graph and swarm execution exhibit fail-fast behavior.

No failure detection or automatic restart

There is no supervisor, no health checks, and no lease mechanism. Community issues explicitly acknowledge that long-running workflows cannot be paused and resumed, and there is no ability to create checkpoints during critical operations.

Session state grows without bound

Per-message persistence creates linear state growth. Sessions with 500 or more messages show 7 second load times and 2GB memory pressure. FileSessionManager also lacks thread safety, so concurrent writes risk data corruption.

AWS-centric with no cross-cloud portability

Strands is built for the AWS ecosystem with tight Bedrock integration and S3 storage. Running agents across multiple clouds or on-premises requires replacing the entire storage and model layer.

Integration

From Message Replay to Execution Resume

Wrap your Strands agent with DaprWorkflowAgentRunner. Instead of replaying conversation history and hoping the LLM follows the same path, each tool invocation becomes a durable Dapr workflow activity with deterministic recovery. Runs on any cloud, powered by the same Dapr runtime trusted by companies like NASA, Grafana Labs, and HSBC.

AWS Strands alone
from strands import Agent
from strands.models import BedrockModel
model = BedrockModel(
model_id="anthropic.claude-sonnet-4-20250514",
region_name="us-east-1",
)
agent = Agent(
model=model,
tools=[search_tool, analysis_tool],
)
# Persists messages, but crash recovery replays
# the event loop from scratch
response = agent("Analyze Q4 market trends")
AWS Strands + DiagridDurable
from strands import Agent
from strands.models import BedrockModel
from diagrid.agent.strands import DaprWorkflowAgentRunner
model = BedrockModel(
model_id="anthropic.claude-sonnet-4-20250514",
region_name="us-east-1",
)
agent = Agent(
model=model,
tools=[search_tool, analysis_tool],
)
# True durable execution with Dapr Workflows
# Runs on any cloud, automatic recovery
runner = DaprWorkflowAgentRunner(
name="market-research",
agent=agent,
max_iterations=10,
)

Comparison

From Prototype to Production

What changes when you add Diagrid to your AWS Strands agents.

Capability
AWS Strands alone
+ Diagrid
Crash recovery
Message history restored but event loop restarts
Deterministic resume from last completed step
Failure detection
None. No supervisor or health checks
Built-in supervisor with heartbeats
Graph failure handling
Failed executions reset and discard all node work
Resumes from last successful node
Retry logic
Only for HTTP 429 rate limits
Configurable retry for all failure types
Cloud portability
AWS-centric (Bedrock, S3)
Any cloud or on-premises
State scalability
Linear growth, 7s loads at 500+ messages
Efficient execution state management
Open-source foundation
Proprietary AWS SDK
Built on CNCF Dapr project

Enterprise-Grade

Enterprise Infrastructure for AWS Strands

Everything your team needs to run AWS Strands agents in production. Built on Dapr, the CNCF project trusted by thousands of enterprises.

Security & Compliance

Zero-Trust Security

Every agent gets a SPIFFE-based cryptographic identity through Dapr's built-in security model. All communication is encrypted with automatic mTLS. Fine-grained policies control which agents can access which tools.

Platform Engineering

End-to-End Observability

Distributed tracing for every workflow execution with per-step input and output inspection. Built on OpenTelemetry, so traces integrate with the tools your team already uses.

Infrastructure

Multi-Region Failover

Deploy across regions with active-passive failover. If a region goes down, Dapr Workflows automatically resume in the standby region from their last checkpoint.

Developers

Durable State Store

Dapr Workflows persist state to a remote store after every activity. Survives process crashes, OOM kills, deployments, and infrastructure failures. Use any supported database as the backend.

Platform Engineering

Multi-Instance Coordination

Dapr's actor placement service ensures each workflow is processed by exactly one instance. Scale horizontally without duplicate executions or race conditions.

Compliance & Ops

Full Execution History

Complete audit trail for every workflow with deterministic replay. Re-run any past execution for debugging, compliance, or analysis. All built on the open-source Dapr project.

How It Works

Three Steps to Production

Keep your existing AWS Strands code. Add production reliability in minutes.

01

Build with AWS Strands

Define your agent, tools, and logic using AWS Strands exactly as you normally would. No special patterns or abstractions required.

02

Wrap with Diagrid

Add one import and wrap your agent with DaprWorkflowAgentRunner (or DaprWorkflowGraphRunner for LangGraph). Each tool call becomes a durable Dapr workflow activity.

03

Deploy to production

Run with Dapr Workflows handling crash recovery, state persistence, distributed coordination, security, and observability. Your agent code runs locally or in the cloud.

FAQ

Frequently Asked Questions

Doesn't Strands already persist state? Why do I need Diagrid?

Strands persists conversation messages through lifecycle hooks like MessageAddedEvent and supports backends like S3SessionManager. But this is conversational context persistence, not execution state management. When a crashed agent is re-initialized, the message history is restored but the event loop starts from scratch. The LLM may repeat work, skip steps, or diverge. Diagrid provides true durable execution where each tool call is persisted as a workflow activity and recovery is deterministic, not probabilistic.

What happens to Strands graph executions when they fail?

When a Strands graph execution fails, the framework's deserialize_state resets everything and discards all completed node work within that execution. There is no way to resume from the last successful node. With Diagrid, each node becomes a durable Dapr workflow activity. If node 3 of 5 fails, the runtime replays the results of nodes 1 and 2 and resumes from node 3.

Can I run Strands agents outside of AWS with Diagrid?

Yes. While Strands itself is built around Bedrock models and S3 storage, Diagrid adds a cloud-agnostic execution layer built on Dapr. Your agent code stays the same, but workflow orchestration and state management run on any cloud or on-premises.

What is Dapr and why does it matter for Strands?

Dapr is a Cloud Native Computing Foundation (CNCF) project used in production by thousands of enterprises including NASA, Grafana Labs, and HSBC. Its workflow engine provides automatic failure detection, durable state persistence, and distributed coordination. Diagrid builds on this vendor-neutral foundation to add true execution durability to Strands agents.

How does Diagrid handle the state growth problem in Strands?

Strands per-message persistence creates linear state growth that causes performance degradation at scale, with documented 7 second load times for sessions over 500 messages. Diagrid manages execution state through the Dapr workflow engine, which stores only the durable state needed for recovery, not the full conversation history. This keeps state compact and recovery fast regardless of conversation length.

Deploy Strands Agents to Production Today

Add automatic failure detection, cross-cloud durability, and enterprise security to your AWS Strands agents. Built on open-source Dapr. Start free.