Resilience Patterns with Polly: Circuit Breakers, Retries, and Timeouts

TL;DR Polly patterns for production: when to retry, circuit breaker configuration, timeout strategies, and combining policies for fault-tolerant ASP.NET Core applications.

Your application works perfectly in development. In production, the database hiccups, the third-party API returns 503s, and the network drops packets.

Resilience patterns are the difference between a service that recovers gracefully and one that cascades into a complete outage.

Common questions this answers

  • When should I use retry vs circuit breaker vs both?
  • How do I configure exponential backoff with jitter?
  • What are the circuit breaker states and how do I tune thresholds?
  • How do I integrate Polly with HttpClientFactory?
  • When should I NOT retry an operation?

Definition (what this means in practice)

Resilience patterns are strategies for handling transient failures in distributed systems. They enable applications to recover from temporary issues (retry), protect against cascading failures (circuit breaker), and prevent resource exhaustion (timeout, bulkhead).

In practice, this means configuring policies that determine how your application responds when external dependencies fail, slow down, or become unavailable.

Terms used

  • Transient failure: a temporary failure that resolves itself (network blip, temporary overload).
  • Circuit breaker: a pattern that stops calling a failing service, giving it time to recover.
  • Exponential backoff: increasing wait times between retries (2s, 4s, 8s, 16s...).
  • Jitter: randomness added to backoff to prevent thundering herd when many clients retry simultaneously.
  • Bulkhead: isolating failures by limiting concurrent operations to a resource.
  • Hedging: sending parallel requests to reduce latency by using the fastest response.

Reader contract

This article is for:

  • Engineers building services that call external APIs or databases.
  • Teams implementing fault tolerance in distributed systems.

You will leave with:

  • A decision framework for choosing resilience patterns.
  • Production-ready configurations for retry, circuit breaker, and timeout.
  • HttpClientFactory integration patterns.

This is not for:

  • Saga pattern or distributed transactions (different problem domain).
  • Distributed circuit breakers across multiple service instances.

Quick start (10 minutes)

If you do nothing else, do this:

Verified on: ASP.NET Core (.NET 10).

The modern approach uses Microsoft.Extensions.Http.Resilience, which replaces the older Microsoft.Extensions.Http.Polly package.

  1. Add the resilience package:
dotnet add package Microsoft.Extensions.Http.Resilience
  1. Add the standard resilience handler to your HttpClient:
// Program.cs
builder.Services.AddHttpClient<IPaymentService, PaymentService>(client =>
{
    client.BaseAddress = new Uri("https://api.payments.example.com");
})
.AddStandardResilienceHandler();

This single line adds:

  • Rate limiter (1,000 permits)
  • Total timeout (30 seconds)
  • Retry (3 attempts, exponential backoff with jitter)
  • Circuit breaker (10% failure ratio threshold)
  • Per-attempt timeout (10 seconds)

For most HTTP client scenarios, the standard handler is production-ready out of the box.

When to use each pattern

Pattern Use When Example
Retry Transient failures likely to resolve Network timeout, HTTP 503
Circuit Breaker Service may be down for extended period Database outage, API rate limit
Timeout Operation might hang indefinitely Slow third-party API
Bulkhead Need to isolate failures Multiple downstream services
Fallback Have alternative behavior Return cached data on failure

Retry patterns

Retry handles transient failures by re-executing the operation after a delay.

Exponential backoff

Each retry waits longer than the previous one, giving the failing service time to recover:

builder.Services.AddHttpClient<IOrderService, OrderService>()
    .AddResilienceHandler("retry-pipeline", builder =>
    {
        builder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true
        });
    });

With these settings:

  • Attempt 1: immediate
  • Retry 1: ~2 seconds (with jitter)
  • Retry 2: ~4 seconds (with jitter)
  • Retry 3: ~8 seconds (with jitter)

Why jitter matters

Without jitter, if 1,000 clients fail at the same moment, they all retry at exactly the same times. This thundering herd can overwhelm the recovering service.

Jitter adds randomness to the delay, spreading retries across time:

builder.AddRetry(new HttpRetryStrategyOptions
{
    BackoffType = DelayBackoffType.Exponential,
    UseJitter = true  // Adds +/- 25% randomness to delays
});

Always enable jitter in production.

Handle specific failures

By default, the HTTP retry strategy handles:

  • HTTP 500+ status codes
  • HTTP 408 (Request Timeout)
  • HTTP 429 (Too Many Requests)
  • HttpRequestException
  • TimeoutRejectedException

To customize which failures trigger retries:

builder.AddRetry(new HttpRetryStrategyOptions
{
    ShouldHandle = static args =>
    {
        return ValueTask.FromResult(args is
        {
            Outcome.Result.StatusCode:
                HttpStatusCode.RequestTimeout or
                HttpStatusCode.TooManyRequests or
                HttpStatusCode.ServiceUnavailable
        });
    }
});

Disable retries for non-idempotent operations

POST, PUT, DELETE, and PATCH operations may not be safe to retry. If the server processed the request but the response was lost, retrying creates duplicates.

builder.Services.AddHttpClient<IOrderService, OrderService>()
    .AddStandardResilienceHandler(options =>
    {
        // Only retry safe HTTP methods (GET, HEAD, OPTIONS)
        options.Retry.DisableForUnsafeHttpMethods();
    });

Or disable for specific methods:

options.Retry.DisableFor(HttpMethod.Post, HttpMethod.Delete);

Circuit breaker patterns

The circuit breaker prevents your application from repeatedly calling a failing service. It has three states:

Circuit breaker states

Closed (normal operation): Requests pass through. The breaker monitors failure rate.

Open (circuit broken): Requests fail immediately with BrokenCircuitException. No calls reach the downstream service.

Half-Open (testing recovery): A limited number of requests pass through. If they succeed, the circuit closes. If they fail, it opens again.

Configuration

builder.Services.AddHttpClient<IInventoryService, InventoryService>()
    .AddResilienceHandler("circuit-breaker-pipeline", builder =>
    {
        builder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            // Open circuit after 10% of requests fail
            FailureRatio = 0.1,

            // Minimum requests before failure ratio is calculated
            MinimumThroughput = 100,

            // Time window for calculating failure ratio
            SamplingDuration = TimeSpan.FromSeconds(30),

            // How long circuit stays open before testing
            BreakDuration = TimeSpan.FromSeconds(30)
        });
    });

Handle BrokenCircuitException

When the circuit is open, calls throw BrokenCircuitException. Handle this gracefully:

public async Task<IActionResult> GetInventory(int productId)
{
    try
    {
        var inventory = await _inventoryService.GetAsync(productId);
        return Ok(inventory);
    }
    catch (BrokenCircuitException)
    {
        // Circuit is open - service is known to be down
        _logger.LogWarning("Inventory service circuit is open");
        return StatusCode(503, new ProblemDetails
        {
            Title = "Service temporarily unavailable",
            Detail = "Inventory service is currently unavailable. Please retry later."
        });
    }
}

Tune thresholds for your traffic

The default settings assume moderate traffic. Adjust based on your load:

Traffic Level MinimumThroughput SamplingDuration BreakDuration
Low (< 100 req/min) 10 60s 30s
Medium (100-1000 req/min) 100 30s 30s
High (> 1000 req/min) 500 10s 15s

Low-traffic services need lower thresholds to detect failures. High-traffic services can use shorter sampling windows.

Timeout patterns

Timeouts prevent operations from hanging indefinitely.

Total timeout vs attempt timeout

The standard resilience handler uses two timeouts:

builder.Services.AddHttpClient<ISearchService, SearchService>()
    .AddStandardResilienceHandler(options =>
    {
        // Maximum time for entire operation including retries
        options.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(30);

        // Maximum time for each individual attempt
        options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(10);
    });

With 3 retries and 10-second attempt timeout:

  • Each attempt can take up to 10 seconds
  • Total operation fails after 30 seconds regardless of retry state

Standalone timeout

For non-HTTP operations:

builder.Services.AddResiliencePipeline("database-timeout", builder =>
{
    builder.AddTimeout(TimeSpan.FromSeconds(5));
});

// Usage
var pipeline = provider.GetRequiredService<ResiliencePipelineProvider<string>>()
    .GetPipeline("database-timeout");

await pipeline.ExecuteAsync(async ct =>
{
    await _database.QueryAsync(sql, ct);
});

Combining patterns

Real production systems combine multiple patterns. Order matters.

Standard resilience handler order

The standard handler applies strategies in this order:

  1. Rate limiter - Prevents overwhelming the service
  2. Total timeout - Caps entire operation duration
  3. Retry - Retries failed attempts
  4. Circuit breaker - Prevents calls when service is down
  5. Attempt timeout - Caps individual attempt duration

This order ensures:

  • Rate limiting happens before any work
  • Total timeout captures the entire retry sequence
  • Circuit breaker prevents retries when service is known to be down
  • Each attempt has its own timeout

Custom pipeline

For fine-grained control:

builder.Services.AddHttpClient<ICriticalService, CriticalService>()
    .AddResilienceHandler("critical-service", builder =>
    {
        // 1. Total timeout for entire operation
        builder.AddTimeout(new TimeoutStrategyOptions
        {
            Timeout = TimeSpan.FromSeconds(60),
            Name = "TotalTimeout"
        });

        // 2. Retry with exponential backoff
        builder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            Delay = TimeSpan.FromSeconds(1)
        });

        // 3. Circuit breaker
        builder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio = 0.1,
            MinimumThroughput = 50,
            SamplingDuration = TimeSpan.FromSeconds(30),
            BreakDuration = TimeSpan.FromSeconds(30)
        });

        // 4. Per-attempt timeout
        builder.AddTimeout(new TimeoutStrategyOptions
        {
            Timeout = TimeSpan.FromSeconds(10),
            Name = "AttemptTimeout"
        });
    });

HttpClientFactory integration

All examples above use HttpClientFactory, which is the recommended pattern.

Named clients

builder.Services.AddHttpClient("PaymentApi", client =>
{
    client.BaseAddress = new Uri("https://api.payments.example.com");
    client.DefaultRequestHeaders.Add("Api-Key", configuration["PaymentApiKey"]);
})
.AddStandardResilienceHandler();

// Usage
public class PaymentService(IHttpClientFactory httpClientFactory)
{
    public async Task<PaymentResult> ProcessAsync(Payment payment)
    {
        var client = httpClientFactory.CreateClient("PaymentApi");
        var response = await client.PostAsJsonAsync("/v1/payments", payment);
        return await response.Content.ReadFromJsonAsync<PaymentResult>();
    }
}

Typed clients

builder.Services.AddHttpClient<IOrderService, OrderService>(client =>
{
    client.BaseAddress = new Uri("https://api.orders.example.com");
})
.AddStandardResilienceHandler();

// Usage
public class OrderService(HttpClient httpClient) : IOrderService
{
    public async Task<Order> GetAsync(int id)
    {
        return await httpClient.GetFromJsonAsync<Order>($"/orders/{id}");
    }
}

Default resilience for all clients

Apply resilience to all HttpClients by default:

builder.Services.ConfigureHttpClientDefaults(clientBuilder =>
{
    clientBuilder.AddStandardResilienceHandler();
});

// Individual clients can override
builder.Services.AddHttpClient("NoResilience")
    .RemoveAllResilienceHandlers();

When NOT to retry

Retrying is not always the right answer.

Do not retry these

Scenario Why
HTTP 400 Bad Request Client error, retrying won't help
HTTP 401/403 Unauthorized Authentication issue, not transient
HTTP 404 Not Found Resource doesn't exist
Non-idempotent POST without safeguards May create duplicates
Business logic failures Application error, not infrastructure

Idempotency is required for safe retries

An operation is idempotent if calling it multiple times produces the same result as calling it once.

Safe to retry:

  • GET /orders/123
  • PUT /orders/123 (replaces entire resource)
  • DELETE /orders/123 (deleting twice = still deleted)

Dangerous to retry without safeguards:

  • POST /orders (creates new order each time)
  • PATCH /orders/123 (may apply partial update twice)

If you must retry non-idempotent operations, implement idempotency keys:

public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
{
    // Client generates unique idempotency key
    var idempotencyKey = request.IdempotencyKey ?? Guid.NewGuid().ToString();

    var httpRequest = new HttpRequestMessage(HttpMethod.Post, "/orders")
    {
        Content = JsonContent.Create(request)
    };
    httpRequest.Headers.Add("Idempotency-Key", idempotencyKey);

    var response = await _httpClient.SendAsync(httpRequest);
    return await response.Content.ReadFromJsonAsync<Order>();
}

The server must check the idempotency key and return the same response for duplicate requests.

Decision framework

Use this framework to choose patterns:

Question If Yes If No
Is the failure transient? Use Retry Don't retry
Is the operation idempotent? Retry all methods Retry GET only, or implement idempotency keys
Could the service be down for a while? Add Circuit Breaker Retry alone may suffice
Could the operation hang? Add Timeout May not need timeout
Do you call multiple downstream services? Consider Bulkhead isolation Combined pipeline sufficient

Copy/paste artifact: production resilience configuration

// Program.cs - Production resilience configuration
using Microsoft.Extensions.Http.Resilience;
using Polly;

// Standard resilience for most HTTP clients
builder.Services.AddHttpClient<IExternalApiClient, ExternalApiClient>(client =>
{
    client.BaseAddress = new Uri(builder.Configuration["ExternalApi:BaseUrl"]!);
})
.AddStandardResilienceHandler(options =>
{
    // Total timeout: 30 seconds for entire operation
    options.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(30);

    // Retry: 3 attempts with exponential backoff and jitter
    options.Retry.MaxRetryAttempts = 3;
    options.Retry.Delay = TimeSpan.FromSeconds(1);
    options.Retry.BackoffType = DelayBackoffType.Exponential;
    options.Retry.UseJitter = true;

    // Disable retry for non-idempotent methods
    options.Retry.DisableForUnsafeHttpMethods();

    // Circuit breaker: open after 10% failures
    options.CircuitBreaker.FailureRatio = 0.1;
    options.CircuitBreaker.MinimumThroughput = 100;
    options.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
    options.CircuitBreaker.BreakDuration = TimeSpan.FromSeconds(30);

    // Per-attempt timeout: 10 seconds
    options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(10);
});

Copy/paste artifact: resilience checklist

Resilience Configuration Checklist

1. Retry configuration
   - [ ] Exponential backoff enabled
   - [ ] Jitter enabled to prevent thundering herd
   - [ ] Max retries appropriate for SLA (typically 2-5)
   - [ ] Non-idempotent methods excluded or have idempotency keys

2. Circuit breaker configuration
   - [ ] Failure ratio tuned for traffic level
   - [ ] MinimumThroughput prevents false positives on low traffic
   - [ ] BreakDuration gives downstream time to recover
   - [ ] BrokenCircuitException handled gracefully in calling code

3. Timeout configuration
   - [ ] Total timeout covers entire operation including retries
   - [ ] Per-attempt timeout prevents single slow request consuming budget
   - [ ] Timeouts aligned with SLA requirements

4. Error handling
   - [ ] 4xx errors not retried (except 408, 429)
   - [ ] BrokenCircuitException returns appropriate error to caller
   - [ ] TimeoutRejectedException handled gracefully

5. Observability
   - [ ] Retry attempts logged
   - [ ] Circuit state changes logged
   - [ ] Metrics exposed for monitoring

Common failure modes

  1. Retry storm: retrying without jitter causes all clients to retry simultaneously, overwhelming the recovering service.
  2. Circuit never opens: MinimumThroughput too high for actual traffic, so failure ratio never calculated.
  3. Circuit never closes: BreakDuration too short, half-open tests fail, circuit reopens immediately.
  4. Timeout too aggressive: per-attempt timeout shorter than normal response time causes constant failures.
  5. Retrying non-idempotent operations: creates duplicate orders, payments, or other side effects.

Checklist

  • Standard resilience handler added to HTTP clients.
  • Jitter enabled on retry policies.
  • Non-idempotent operations excluded from retry or use idempotency keys.
  • Circuit breaker thresholds tuned for traffic level.
  • BrokenCircuitException handled in calling code.
  • Timeouts configured for both total operation and per-attempt.

FAQ

Should I use Microsoft.Extensions.Http.Polly or Microsoft.Extensions.Http.Resilience?

Use Microsoft.Extensions.Http.Resilience. The older Microsoft.Extensions.Http.Polly package is deprecated. The new package is built on Polly v8 and integrates better with .NET's resilience infrastructure.

What is the difference between Polly v7 and v8?

Polly v8 introduced a new API based on ResiliencePipeline instead of Policy. The new API is more composable and integrates with Microsoft.Extensions.Resilience. The concepts (retry, circuit breaker, timeout) remain the same.

How do I know if my circuit breaker thresholds are correct?

Monitor your circuit breaker state transitions. If the circuit opens too frequently on minor issues, increase MinimumThroughput or FailureRatio. If it never opens during actual outages, decrease thresholds.

Should I retry database operations?

Yes, for transient failures like connection timeouts. EF Core has built-in retry with EnableRetryOnFailure(). For raw ADO.NET, wrap in a Polly retry policy. Ensure operations are idempotent or use transactions.

How do I test resilience patterns?

Use chaos engineering approaches. Inject failures in test environments using tools like Simmy (Polly's chaos engineering extension) or configure test doubles that fail intermittently. Verify that retries happen, circuits open, and timeouts trigger as expected.

What about distributed circuit breakers?

The patterns in this article are per-instance. For distributed circuit breakers (shared state across multiple service instances), consider external state stores like Redis or purpose-built solutions. This adds complexity and is often unnecessary for most applications.

What to do next

Add Microsoft.Extensions.Http.Resilience to your project and apply .AddStandardResilienceHandler() to your HTTP clients. Review any existing retry logic for idempotency concerns.

For more on building production-quality ASP.NET Core applications, read Async/Await Pitfalls: The Deadlocks That Ship to Production.

If you want help implementing resilience patterns in your architecture, reach out via Contact.

References

Author notes

Decisions:

  • Focus on Microsoft.Extensions.Http.Resilience over raw Polly. Rationale: it's the modern, supported approach with better HttpClientFactory integration.
  • Recommend standard resilience handler as default. Rationale: production-ready defaults, less configuration burden.
  • Emphasize idempotency for retry safety. Rationale: retrying non-idempotent operations is a common production bug.

Observations:

  • Teams often add retry without circuit breaker, causing retry storms during outages.
  • Circuit breaker thresholds copied from tutorials without adjusting for actual traffic patterns.
  • Non-idempotent POST operations retried, causing duplicate records.
  • Timeouts set without considering retry delays, causing unexpected total wait times.