What is the difference between deterministic and non-deterministic AI workflows?

Deterministic workflows follow a predefined execution path (like a DAG) where each step is known in advance. Non-deterministic workflows, like autonomous agents running agentic loops, make runtime decisions about which tools to call based on LLM reasoning, making the execution path unpredictable.

Why is durable execution important for AI agents?

Without durable execution, a crash at step 15 of a 20-step agent workflow means re-running all 15 completed steps, re-paying for every LLM call, and risking different results due to LLM non-determinism. Durable execution persists the result of each step so recovery resumes from the last completed step.

What is Dapr Workflows?

Dapr Workflows is an AI-native workflow orchestration engine that is part of the CNCF Dapr project. It provides durable execution with automatic failure detection, state persistence, and distributed coordination for AI agent workloads. It is vendor-neutral and used by thousands of enterprises.

What is AI Orchestration?

Q: What is AI orchestration?

AI orchestration is the coordination of AI agent workflows, tool calls, and multi-agent interactions in a reliable and recoverable way. It ensures that when agents fail, they can resume from where they left off rather than starting over.

AI agents are only as reliable as the infrastructure orchestrating them. This guide explains how workflow orchestration works across deterministic pipelines, autonomous agents, and multi-agent systems, and why durable execution is the difference between a demo and a production system.

AI orchestration is how agents get work done

An AI agent is a system that uses an LLM to reason and take actions through tool calls. But reasoning alone does not accomplish tasks. The agent needs something to coordinate the sequence of actions: deciding what to do first, passing data between steps, handling failures, and knowing when the job is done.

That coordination layer is orchestration. It determines the control flow of your agent system. There are three fundamentally different patterns, each with different reliability characteristics.

Three Types of Agent Workflows

Each pattern has different failure modes and different requirements for production reliability.

Pattern 1

Deterministic Workflows

The execution path is defined at build time. You know exactly which steps will run and in what order. Think of it as a directed acyclic graph (DAG) where the nodes are tool calls or LLM invocations and the edges are fixed.

LangGraph state graphs, simple CrewAI task sequences, and traditional workflow engines all use this pattern. The LLM might generate content at each step, but the control flow is predetermined.

Predictable execution path Easy to test and validate No runtime adaptability

Fetch Data

LLM: Analyze

Save Results

LLM: Generate Report

Every step is known at build time. The path never changes.

Pattern 2

Autonomous Agents (Agentic Loops)

The LLM decides what to do at runtime. It observes the current state, reasons about the next action, and picks a tool to call. Then it observes the result and decides again. This agentic loop — sometimes called a ReAct loop — is the core pattern behind most modern agent frameworks.

CrewAI autonomous agents, PydanticAI agents, OpenAI Agents SDK, and Strands all use this pattern. The execution path is different every time, even for the same input.

Adapts to any situation Handles novel tasks Unpredictable execution length Recovery is non-trivial

LLM: Observe + Reason

Tool Call (decided at runtime)

LLM: Evaluate Result

Done?

Loop back

Yes

Return result

The LLM decides each step. No two runs follow the same path.

Pattern 3

Multi-Agent Orchestration

Multiple agents collaborate on a task, each with their own tools and expertise. One agent might research while another analyzes and a third writes the report. Agents hand off to each other, delegate subtasks, or run in parallel.

OpenAI agent handoffs, CrewAI crews, LangGraph multi-agent graphs, and Google ADK sub-agents all use this pattern. Each agent is itself running an agentic loop, making the entire system deeply non-deterministic.

Specialized agents for complex tasks Parallel execution possible Failure in one agent affects all Coordination state is complex

Coordinator Agent

Research Agent

3 tools

Analysis Agent

2 tools

Writer Agent

1 tool

Each agent runs its own agentic loop. A crash in one affects all.

What Happens When an Agent Crashes?

Without durable orchestration, a single failure creates a chain reaction that wastes time, money, and trust.

Cascading Failure in a 15-Step Agent Workflow

Completed (lost)

Crash point

Never reached

You pay for every LLM call again

12 completed steps means 12 LLM calls you already paid for. Without durable state, recovery means re-running all of them from scratch. For models like GPT-4 or Claude, each call costs real money. Multiply by every failure across your fleet.

Replay produces different results

LLMs are non-deterministic. When you re-run a failed agent from the start, the LLM may choose different tools, skip steps, or reason differently. You do not get a replay of the original execution. You get a completely new one.

Failures cascade across agents

In multi-agent systems, one crashed agent can stall the entire pipeline. Downstream agents wait for data that never arrives. Upstream agents hold state for a handoff that never completes. One failure becomes a system-wide outage.

Without Durable Orchestration

Crash at step 13 means re-running steps 1 through 12

Every LLM call is re-billed at full price

LLM non-determinism means different results on retry

No one knows it crashed until someone checks

Downstream agents stall waiting for handoffs that never arrive

With Durable Orchestration

Crash at step 13 resumes from step 13. Steps 1 through 12 are replayed from saved state.

Only the failed step is re-executed. No duplicate LLM charges.

Prior results are deterministically replayed from durable state, not re-generated

Built-in supervisor detects the failure and triggers automatic recovery

Handoff state is persisted. Downstream agents resume seamlessly.

AI-Native Workflow Orchestration

Traditional workflow engines were built for microservices. AI agents need an orchestration layer designed for non-deterministic, long-running, LLM-driven workloads.

Durable execution for every tool call

Each tool call in an agentic loop becomes a durable activity. The result is persisted immediately after completion. On failure, the orchestrator replays saved results for completed steps and resumes execution from the exact point of failure.

Automatic failure detection

A supervisor process monitors agent health through heartbeats and lease management. When an agent crashes, recovery starts automatically. No manual intervention, no silent failures, no engineers discovering outages hours later.

Multi-agent coordination

Distributed locking prevents duplicate executions when multiple instances are running. Handoff state between agents is persisted as part of the workflow. If any agent in the chain fails, recovery is coordinated across the entire system.

Built for long-running workloads

Deep agent chains can run for minutes or hours. AI-native orchestration handles long execution windows with timer-based scheduling, human-in-the-loop waiting, and state that persists across process restarts and deployments.

Open Source

Dapr Workflows: AI-Native Orchestration from the CNCF

Dapr is a Cloud Native Computing Foundation (CNCF) project used in production by thousands of organizations including NASA, Grafana Labs, and HSBC. Its workflow engine provides the durable execution primitives that AI agent frameworks are missing.

Instead of building your own recovery logic, retry mechanisms, and state management, Dapr Workflows provides these as infrastructure. Your agent code stays the same. Each tool call becomes a durable workflow activity automatically.

Vendor-neutral: runs on any cloud or on-premises

Supports Python, JavaScript, .NET, Java, and Go

Native integrations with LangGraph, CrewAI, PydanticAI, Google ADK, Strands, OpenAI Agents, and more

Automatic mTLS with SPIFFE workload identity