What Happens When Your Google ADK Agent Crashes in Production?
Google ADK uses an event-sourced architecture and supports resumability. But the framework requires the caller to detect interruptions and re-invoke with the correct invocation_id. When your agent crashes, it waits for someone to notice. Diagrid adds true durable execution with automatic failure detection and recovery, built on the open-source Dapr project and portable across any cloud.
Production Gap Analysis
Why Google ADK's Event Sourcing Isn't Enough
Google ADK provides an event-sourced model that stores all interactions as immutable events and supports resumability. But the responsibility for detecting failures and triggering recovery still falls entirely on you.
Resumability without automatic recovery
ADK's ResumabilityConfig enables replay-based resumption and completed tool results can be returned from event history. But the caller must detect the interruption and re-invoke with the original invocation_id. There is no automatic recovery.
No failure detection
If an ADK agent crashes, there is no supervisor or watchdog to notice. The documentation explicitly states that the caller must detect workflow interruption. In production, that means building your own monitoring and recovery system.
Google Cloud-centric
ADK is designed primarily for the Google Cloud ecosystem with deployment on Vertex AI. Running agents across multiple clouds or on-premises requires significant custom infrastructure.
No multi-instance coordination
Scaling ADK agents to multiple replicas has no coordination layer. Without external locking, duplicate agent executions are possible when multiple processes try to handle the same workflow.
Limited cross-cloud tracing
ADK integrates with Google Cloud logging, but lacks cross-cloud distributed tracing with per-tool-call granularity and a vendor-neutral execution history.
No zero-trust security model
No built-in mTLS between agent components, no SPIFFE-based workload identity, and no fine-grained policy enforcement for tool access independent of Google IAM.
Integration
Add Durable Execution to ADK Agents
Wrap your Google ADK agent with DaprWorkflowAgentRunner. Each tool call becomes a durable Dapr workflow activity with automatic failure detection and recovery. Runs on any cloud, powered by the same Dapr runtime trusted by thousands of enterprises.
from google.adk import Agentfrom google.adk.tools import FunctionToolagent = Agent( name="research-analyst", model="gemini-2.0-flash", tools=[search_tool, analysis_tool], instruction="You are a research analyst",)# Event sourcing saves state, but caller must# detect failures and re-invoke manuallysession = agent.create_session()response = agent.run(session, "Analyze Q4 trends")from google.adk import Agentfrom google.adk.tools import FunctionToolfrom diagrid.agent.google_adk import DaprWorkflowAgentRunneragent = Agent( name="research-analyst", model="gemini-2.0-flash", tools=[search_tool, analysis_tool], instruction="You are a research analyst",)# True durable execution with Dapr Workflows# Runs on any cloud, automatic recoveryrunner = DaprWorkflowAgentRunner( name="market-research", agent=agent, max_iterations=10,)Comparison
From Prototype to Production
What changes when you add Diagrid to your Google ADK agents.
Enterprise-Grade
Enterprise Infrastructure for Google ADK
Everything your team needs to run Google ADK agents in production. Built on Dapr, the CNCF project trusted by thousands of enterprises.
Zero-Trust Security
Every agent gets a SPIFFE-based cryptographic identity through Dapr's built-in security model. All communication is encrypted with automatic mTLS. Fine-grained policies control which agents can access which tools.
End-to-End Observability
Distributed tracing for every workflow execution with per-step input and output inspection. Built on OpenTelemetry, so traces integrate with the tools your team already uses.
Multi-Region Failover
Deploy across regions with active-passive failover. If a region goes down, Dapr Workflows automatically resume in the standby region from their last checkpoint.
Durable State Store
Dapr Workflows persist state to a remote store after every activity. Survives process crashes, OOM kills, deployments, and infrastructure failures. Use any supported database as the backend.
Multi-Instance Coordination
Dapr's actor placement service ensures each workflow is processed by exactly one instance. Scale horizontally without duplicate executions or race conditions.
Full Execution History
Complete audit trail for every workflow with deterministic replay. Re-run any past execution for debugging, compliance, or analysis. All built on the open-source Dapr project.
How It Works
Three Steps to Production
Keep your existing Google ADK code. Add production reliability in minutes.
Build with Google ADK
Define your agent, tools, and logic using Google ADK exactly as you normally would. No special patterns or abstractions required.
Wrap with Diagrid
Add one import and wrap your agent with DaprWorkflowAgentRunner (or DaprWorkflowGraphRunner for LangGraph). Each tool call becomes a durable Dapr workflow activity.
Deploy to production
Run with Dapr Workflows handling crash recovery, state persistence, distributed coordination, security, and observability. Your agent code runs locally or in the cloud.
FAQ
Frequently Asked Questions
Doesn't Google ADK already support resumability? Why do I need Diagrid?
ADK's event-sourced model and ResumabilityConfig allow replaying completed tool results from history. But the caller must detect the interruption and re-invoke with the original invocation_id. There is no automatic failure detection, no supervisor process, and no distributed coordination. Diagrid provides all of this through the Dapr workflow engine.
Can I run Google ADK agents outside of Google Cloud with Diagrid?
Yes. Diagrid adds a cloud-agnostic durability layer built on Dapr. Your ADK agent code stays the same, but workflow state is managed by the Dapr runtime, which runs on any cloud, on-premises, or in hybrid environments.
Do I need to change my Google ADK agent code?
No. Your Agent definition, tools, and model configuration stay exactly the same. You wrap the agent with DaprWorkflowAgentRunner. Your agent logic is unchanged.
What is Dapr and why does it matter for Google ADK?
Dapr is a Cloud Native Computing Foundation (CNCF) project used in production by thousands of enterprises. Its workflow engine provides automatic failure detection, durable state persistence, and distributed coordination. Diagrid builds on this proven, vendor-neutral foundation to add production infrastructure to ADK agents.
What security does Diagrid add to Google ADK agents?
Diagrid adds SPIFFE-based cryptographic workload identity, automatic mTLS between all agent components, and fine-grained policy enforcement. This provides zero-trust security that operates independently of any single cloud provider's IAM system.
Deploy Google ADK Agents to Production Today
Add automatic failure detection, cross-cloud durability, and enterprise security to your Google ADK agents. Built on open-source Dapr. Start free.
