All Blogs

Bilgin Ibryam

|

December 5, 2024

Fault tolerant microservices made easy with Dapr resiliency

In this post, you'll learn how to make your applications resilient without writing custom code or becoming a YAML expert—thanks to Conductor's new resiliency builder.

Why Distributed Systems Are Different

Remember the good old days when your application ran on a single machine? Life was simpler then. Today, everyone is building distributed systems whether they like it or not. Even a basic cloud application involves multiple services talking to each other across network boundaries. When your application spans multiple services, you're not just dealing with code anymore - you're dealing with the laws of physics. Network calls take time, services can disappear, and what worked perfectly on your machine can fail spectacularly in production. These failures have a direct impact on user experience and business outcomes, leading to slow page loads or errors that frustrate users and drive them away. 

Consider a typical microservice architecture:

  • Service A calls Service B
  • Service B writes to a database
  • There's a message queue in there somewhere
  • Oh, and don't forget the third-party APIs

Each of these network hops is a potential point of failure. And unlike in-process method calls that either work or throw an exception, network calls can fail in maddeningly creative ways:

  • They might timeout without telling you what happened.
  • They might have succeeded but the response was lost.
  • They might fail halfway through.
  • They might just... hang…
  • Or my personal favorite: they might succeed but take so long that the caller has given up.

Whether you're building microservices on Kubernetes or another distributed system platform, you need a systematic approach to handle these network failures. One that works consistently across all your services, regardless of their implementation language or framework.

Resiliency techniques for Daprized applications

This post will explore essential resilience strategies with Dapr that can protect your services from network failures and how Diagrid Conductor makes it easy to get these configurations right.

Timeouts: Your First Line of Defense

Network calls can fail for many reasons. A service might be overloaded, a network connection might be congested, or a process might be stuck. Without timeouts, your application could hang indefinitely waiting for responses that may never come. By setting a timeout duration, developers can cut off unresponsive services, freeing up resources to handle new requests and preventing requests from lingering in limbo. But the key to setting appropriate timeouts starts with understanding your service's real-world behavior. By measuring actual response times in production (for example through Conductor’s application latency graphs), you can use the 95th percentile as a baseline to find a good timeout value that ensures the system fails fast when things go wrong, instead of masking issues by allowing extremely slow requests to complete.

gRPC call latency shown in Conductor

In addition, remember that your service likely doesn't operate in isolation. When setting timeouts, you need to account for your entire dependency chain, including infrastructure calls. A common mistake is to set a timeout below the largest cumulative response time, inevitably guaranteeing failures. Conductor’s Apps Graph helps with this by allowing a single service to be isolated to see all of its call dependencies and their latencies.

Service dependencies view in Conductor

A great resource on the topic of timeouts and resiliency in general is Sam Newman's Building Resilient Distributed Systems book.

Timeouts with Resiliency Policies in Dapr

One of Dapr's key strengths is how it unifies configurations across different programming languages and frameworks. Instead of dealing with language-specific implementations or wrestling with multiple timeout types (connection, socket, response, connection pool lock, etc.), Dapr provides a single, consistent way to configure resiliency policies that works whether you're using Go, Java, Python, or any other supported language. Here's the core structure of a Dapr resiliency policy:

Dapr resiliency policy structure

The configuration for Dapr resiliency policies is composed of three main parts: metadata defining where the policy applies, policies specifying the resiliency name and behaviors (like timeouts, retries, or circuit breakers), and targets determining which interactions these policies act on - whether that's communication with other services, infrastructure components, or actors. Once defined, apply this configuration to your local Dapr components directory as dapr-resiliency.yaml, or to your Kubernetes cluster using:

kubectl apply -f dapr-resiliency.yaml

The following sections explore complete resiliency configurations for common scenarios.

Retries: Recovering from Transient Failures

A request might fail because it hit a service that was being upgraded, encountered network congestion, or was routed to an overloaded instance. These are transient errors and in these cases, sending the same request to a different instance or after the condition has cleared up can turn a failure into a success. However, not every failure should trigger a retry attempt and the type of error often tells you whether retrying makes sense. While there are business considerations too, the status codes for HTTP are a good first indicators if a call should be retried or not:

  • 404 Not Found: Don't retry - the resource doesn't exist
  • 400 Bad Request: Don't retry - your request is invalid
  • 401 Unauthorized: Don't retry - get new credentials first
  • 503 Service Unavailable: Do retry - service might recover
  • 504 Gateway Timeout: Do retry - temporary network issue

Similar considerations apply to gRPC status codes:

  • Code 1 CANCELLED: Don't retry
  • Code 3 INVALID_ARGUMENT: Don't retry
  • Code 4 DEADLINE_EXCEEDED: Retry with backoff
  • Code 5 NOT_FOUND: Don't retry
  • Code 8 RESOURCE_EXHAUSTED: Retry with backoff

Retry Configuration with Dapr

Retries and timeouts naturally complement each other - timeouts ensure your system fails fast when needed, while retries help in recovering from temporary glitches. In a similar way to how standalone timeouts were configured in Dapr, you can configure retries in the same way. See an example policy that  combines them:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
   name: order-service-resiliency
   namespace: production
scope: [order-service]  
spec:
   policies:
      timeouts:
         shipment-timeout: 500ms     # Kill long requests for the first line of defense
      retries:
         shipment-retry:
            policy: exponential    # Retry a failed request while increasing the delay
            maxInterval: 5s       # Maximum time interval between retries 
            maxRetries: 10         # Retry 10 times
   targets:
      apps:
         shipment-service:
            timeout: shipment-timeout
            retry: shipment-retry     # Apply both policies to the service

With this configuration, if a request times out after 500ms, the retry policy kicks in with exponential backoff, starting at 500ms (default) and increasing up to 5s max between request attempts. After 10 failed attempts, if the operation is still failing, Dapr will stop retrying. Dapr automatically adds jitter (small random variations) to exponential retry delays, which helps prevent retry storms by spreading out retries from multiple failing clients across a wider time window, giving services a better chance to recover.

Starting in Dapr 1.15, you'll also be able to specify which error codes should trigger retries (tracking via GitHub issue #6683), giving you even finer control over your retry behavior.

When Retrying is Not Safe

Not all operations can be safely retried. Some operations are naturally idempotent, meaning the request can be repeated without changing the final result. Examples include read operations like getting a customer's profile or querying an order status. These operations can be retried safely because multiple attempts do not alter the outcome. However, other requests like processing a payment, submitting an order, or incrementing a counter can have unintended side effects if retried without caution. To make operations safe for retry, you often need to add idempotency mechanisms. For instance, when processing a payment, you can generate a unique idempotency key for each transaction. Before processing, the system checks if that key was already used, preventing double-charging even if the original request succeeded but the response was lost. While you can implement such mechanisms yourself, Dapr could soon offer built-in support for idempotency (tracked via GitHub issue #7334), making it easier to build retry-safe distributed applications. Idempotency is a common requirement for retries, and they usually go hand in hand to ensure operations are safe and reliable.

Circuit Breakers: Preventing Cascading Failures

While retries can help clients recover from transient failures, they can also make a bad situation worse. Take a service that is struggling under high load, causing requests to timeout. Each client retries these failed requests, creating even more load on the already overwhelmed service. Soon, you're in a death spiral - the more requests fail, the more retries are generated, making the service even less likely to recover. What you need is a way to reduce load when a service is struggling. This is where circuit breakers come in.

Circuit breakers statues shown in Conductor

Inspired by their electrical namesakes, circuit breakers in distributed systems "trip" or open to prevent cascading failures. After a certain number of requests fail, the circuit breaker opens, causing subsequent requests to fail immediately without even trying to reach the struggling service. This gives the service time to recover by preventing resources from being tied up in new requests that are likely to fail. A circuit breaker can also enter a “half-open” state whereby a few requests are let in to gauge whether the service has recovered or not. This is the mechanism that allows it to go back into a “closed” state, letting all requests through to the service.

Easy Circuit Breakers Configuration with Conductor

Creating circuit breakers via Dapr resiliency policies follows a similar format to timeouts and retries, but involves more settings to configure. To simplify this process, use the resiliency builder feature in Diagrid Conductor.

The Resiliency Policy Builder complements the recently launched Component Builder, providing a simple, visual way to craft Dapr YAML specifications without hassle. It saves time and prevents common errors through:

  • Parameter validation to prevent mistakes.
  • Auto-completion for values including namespaces, apps, components, and actor names.
  • Guidance on building field dependencies and complex policy combinations.
  • In-app documentation for every field, including formats and allowed values.
  • Easy overrides for Dapr's built-in default retry policies.

With this wizard, you can quickly create Dapr resiliency policies, copy or download them, and apply them to your local app or production environment. 

Rate Limiting: Protecting Services Under Load

Sometimes you don't have control over client-side behavior and need to protect your services from overload. Even successful requests, if numerous enough, can overwhelm a service. Consider a scenario where a client application has a bug causing it to make too many API calls, or when a sudden spike in legitimate traffic exceeds your service's capacity to handle requests. Without rate limiting, your service could become unresponsive or crash, affecting all users. Server-side rate limiting techniques can also protect your application from Denial of Service (DoS) attacks. These aren't always malicious - sometimes they're from bugs in your own software, like a misconfigured retry loop or an overly aggressive polling consumer app. Dapr offers two approaches to rate limiting to protect the server-side:

  1. Global Concurrency Control
    The dapr.io/app-max-concurrency annotation limits how many requests can be in flight at any point, regardless of the source - whether they're direct service calls, pub/sub message processing, or input bindings. You can also configure the maximum size of these requests for HTTP header and request body to prevent memory exhaustion from large payloads. Learn more in the concurrency control and request size documentation.
  1. HTTP Rate Limiting Middleware
    For finer-grained control over HTTP traffic, Dapr provides a rate limit middleware that can restrict requests per second per client IP, allowing you to prevent any single client from overwhelming your service.

Diagrid Conductor offers a comprehensive overview of every Daprized application, enabling you to easily inspect all Dapr configurations and annotations.

Application configuration overview in Conductor

These server-side protections, combined with client-side resiliency policies offer a stronger foundation against various types of service overload and misbehaving clients.

Taking Action: Your Resilience Checklist

While timeouts, retries, and circuit breakers form the foundation of network resiliency, getting your  applications to a production-ready status requires a holistic approach. Diagrid Conductor helps you fortify your Dapr installations through its continuous comprehensive checks for observability, performance, security and resiliency advisories. While this post focuses on resiliency, keep in mind all these aspects work together to help you create robust distributed systems. You can perform a quick production readiness check for your Dapr cluster by connecting it to Conductor. Conductor is completely free to use, and signing up takes just seconds.

Here are the steps to make your Dapr applications more resilient with Conductor:

  1. Identify Critical Paths. Use Conductor's Apps Graph to map interaction patterns across apps and infrastructure components. Discovering slow interactions and potential bottlenecks.
  2. Configure Sensible Timeouts. Set timeouts based on measured service latencies, not arbitrary values. Use p95 latencies as a baseline, avoiding optimization for outliers.
  3. Analyze Error Patterns. Use Conductor's detailed metrics to understand failures per protocol and endpoint. Enable increasedCardinality in Dapr if needed - Conductor will notify you if that is required.
  4. Implement Safe Retries. Add retries with appropriate backoff values only for idempotent operations. Remember: infinite retries are as dangerous as no retries.
  5. Set Up Circuit Breakers. Configure self-healing thresholds based on your service recovery patterns to prevent cascade failures.
  6. Create Resilience Policies. Use Conductor's Policy Builder to avoid configuration mistakes and time-wasted reading docs. Get guided assistance for complex policy combinations.
  7. Monitor Policy Effectiveness. Track policy activations in Conductor to discover silently recovering errors that might indicate larger issues.
  8. Control Service Load. Implement rate limiting based on Conductor's resource usage graphs and protect services from overload.
  9. Optimize Resources. Use Conductor resource recommendations to match resource allocation with actual demand and act on cost-saving opportunities.
  10. Stay Current. Review your Dapr installation quarterly for best practices with Conductor Advisor. Receive alerts for optimization opportunities as you upgrade Dapr.

Are you ready to make your Dapr applications more resilient? Start by Signing up for Conductor and build your first resiliency policy in seconds. Connect a cluster and get advisories to fine-tune your Cluster.