What Happens When Your Strands Agent Crashes in Production?
AWS Strands Agents persists every message as it happens through lifecycle event hooks and supports storage backends like S3SessionManager. But conversation history is not execution state. When a crashed agent is re-initialized with the same session_id, Strands restores the full message history but the event loop starts from scratch. The LLM may repeat work, skip steps, or diverge entirely from its previous execution path. Diagrid adds true durable execution so your Strands agents resume exactly where they left off, built on the open-source Dapr project and portable across any cloud.
Production Gap Analysis
Why Strands Session Persistence Isn't Enough
Strands persists conversation context through per-message lifecycle hooks like MessageAddedEvent and AfterNodeCallEvent, with pluggable backends including FileSessionManager and S3SessionManager. But knowing what an agent said is fundamentally different from guaranteeing what it did. Production reliability requires execution state management, not just conversational context persistence.
Conversation restore is not execution resume
When a crashed Strands agent is re-initialized with the same session_id, the full message history is restored. But the event loop starts from scratch. The LLM sees prior messages and may repeat completed work, skip steps, or take an entirely different path. Restoring context is not the same as resuming execution.
Failed graph executions reset completely
If a Strands graph execution fails, the framework's deserialize_state resets everything and discards all completed node work within that execution. There is no way to resume from the last successful node. The entire graph starts over.
No automatic retry for real failures
The only built-in retry mechanism handles ModelThrottledException for HTTP 429 rate limits. Tool failures, network errors, and service outages trigger no automatic recovery. Both graph and swarm execution exhibit fail-fast behavior.
No failure detection or automatic restart
There is no supervisor, no health checks, and no lease mechanism. Community issues explicitly acknowledge that long-running workflows cannot be paused and resumed, and there is no ability to create checkpoints during critical operations.
Session state grows without bound
Per-message persistence creates linear state growth. Sessions with 500 or more messages show 7 second load times and 2GB memory pressure. FileSessionManager also lacks thread safety, so concurrent writes risk data corruption.
AWS-centric with no cross-cloud portability
Strands is built for the AWS ecosystem with tight Bedrock integration and S3 storage. Running agents across multiple clouds or on-premises requires replacing the entire storage and model layer.
Integration
From Message Replay to Execution Resume
Wrap your Strands agent with DaprWorkflowAgentRunner. Instead of replaying conversation history and hoping the LLM follows the same path, each tool invocation becomes a durable Dapr workflow activity with deterministic recovery. Runs on any cloud, powered by the same Dapr runtime trusted by companies like NASA, Grafana Labs, and HSBC.
from strands import Agentfrom strands.models import BedrockModelmodel = BedrockModel( model_id="anthropic.claude-sonnet-4-20250514", region_name="us-east-1",)agent = Agent( model=model, tools=[search_tool, analysis_tool],)# Persists messages, but crash recovery replays# the event loop from scratchresponse = agent("Analyze Q4 market trends")from strands import Agentfrom strands.models import BedrockModelfrom diagrid.agent.strands import DaprWorkflowAgentRunnermodel = BedrockModel( model_id="anthropic.claude-sonnet-4-20250514", region_name="us-east-1",)agent = Agent( model=model, tools=[search_tool, analysis_tool],)# True durable execution with Dapr Workflows# Runs on any cloud, automatic recoveryrunner = DaprWorkflowAgentRunner( name="market-research", agent=agent, max_iterations=10,)Comparison
From Prototype to Production
What changes when you add Diagrid to your AWS Strands agents.
Enterprise-Grade
Enterprise Infrastructure for AWS Strands
Everything your team needs to run AWS Strands agents in production. Built on Dapr, the CNCF project trusted by thousands of enterprises.
Zero-Trust Security
Every agent gets a SPIFFE-based cryptographic identity through Dapr's built-in security model. All communication is encrypted with automatic mTLS. Fine-grained policies control which agents can access which tools.
End-to-End Observability
Distributed tracing for every workflow execution with per-step input and output inspection. Built on OpenTelemetry, so traces integrate with the tools your team already uses.
Multi-Region Failover
Deploy across regions with active-passive failover. If a region goes down, Dapr Workflows automatically resume in the standby region from their last checkpoint.
Durable State Store
Dapr Workflows persist state to a remote store after every activity. Survives process crashes, OOM kills, deployments, and infrastructure failures. Use any supported database as the backend.
Multi-Instance Coordination
Dapr's actor placement service ensures each workflow is processed by exactly one instance. Scale horizontally without duplicate executions or race conditions.
Full Execution History
Complete audit trail for every workflow with deterministic replay. Re-run any past execution for debugging, compliance, or analysis. All built on the open-source Dapr project.
How It Works
Three Steps to Production
Keep your existing AWS Strands code. Add production reliability in minutes.
Build with AWS Strands
Define your agent, tools, and logic using AWS Strands exactly as you normally would. No special patterns or abstractions required.
Wrap with Diagrid
Add one import and wrap your agent with DaprWorkflowAgentRunner (or DaprWorkflowGraphRunner for LangGraph). Each tool call becomes a durable Dapr workflow activity.
Deploy to production
Run with Dapr Workflows handling crash recovery, state persistence, distributed coordination, security, and observability. Your agent code runs locally or in the cloud.
FAQ
Frequently Asked Questions
Doesn't Strands already persist state? Why do I need Diagrid?
Strands persists conversation messages through lifecycle hooks like MessageAddedEvent and supports backends like S3SessionManager. But this is conversational context persistence, not execution state management. When a crashed agent is re-initialized, the message history is restored but the event loop starts from scratch. The LLM may repeat work, skip steps, or diverge. Diagrid provides true durable execution where each tool call is persisted as a workflow activity and recovery is deterministic, not probabilistic.
What happens to Strands graph executions when they fail?
When a Strands graph execution fails, the framework's deserialize_state resets everything and discards all completed node work within that execution. There is no way to resume from the last successful node. With Diagrid, each node becomes a durable Dapr workflow activity. If node 3 of 5 fails, the runtime replays the results of nodes 1 and 2 and resumes from node 3.
Can I run Strands agents outside of AWS with Diagrid?
Yes. While Strands itself is built around Bedrock models and S3 storage, Diagrid adds a cloud-agnostic execution layer built on Dapr. Your agent code stays the same, but workflow orchestration and state management run on any cloud or on-premises.
What is Dapr and why does it matter for Strands?
Dapr is a Cloud Native Computing Foundation (CNCF) project used in production by thousands of enterprises including NASA, Grafana Labs, and HSBC. Its workflow engine provides automatic failure detection, durable state persistence, and distributed coordination. Diagrid builds on this vendor-neutral foundation to add true execution durability to Strands agents.
How does Diagrid handle the state growth problem in Strands?
Strands per-message persistence creates linear state growth that causes performance degradation at scale, with documented 7 second load times for sessions over 500 messages. Diagrid manages execution state through the Dapr workflow engine, which stores only the durable state needed for recovery, not the full conversation history. This keeps state compact and recovery fast regardless of conversation length.
Deploy Strands Agents to Production Today
Add automatic failure detection, cross-cloud durability, and enterprise security to your AWS Strands agents. Built on open-source Dapr. Start free.
