Distributed systems fail. The question is not if, but how gracefully. The patterns in this article are resilience code that compiles, appears correct, and then makes outages worse instead of better.
Common questions this answers
- Why do my retries make downstream services slower?
- How do circuit breakers cause outages instead of preventing them?
- What timeout settings actually work in production?
- When do retries help vs. hurt?
- How do I build fallbacks that maintain user experience?
Definition (what this means in practice)
Resilience anti-patterns are fault-handling strategies that fail to achieve their goals or make failures worse. They often appear during normal operation but cause cascading failures during partial outages.
In practice, this means reviewing retry policies, timeout configurations, circuit breaker thresholds, and fallback strategies before they're tested by production incidents.
Terms used
- Transient failure: a temporary condition that clears on its own (network glitch, service restart)
- Retry storm: exponential growth of retry requests overwhelming a recovering service
- Circuit breaker: stops calling a failing service to give it time to recover
- Fallback: alternative behavior when primary operation fails
- Bulkhead: isolation pattern preventing one failure from affecting others
Reader contract
This article is for:
- Engineers building services that call external dependencies
- Teams experiencing cascading failures during partial outages
You will leave with:
- Recognition of 6 resilience anti-patterns
- Polly policy configurations that actually work
- Detection strategies for broken resilience patterns
This is not for:
- Basic HTTP client usage (assumes familiarity with HttpClient)
- Polly introduction (assumes understanding of Polly concepts)
Quick start (10 minutes)
If you do nothing else, review your HTTP client configurations:
- Check that retries have exponential backoff with jitter
- Verify timeouts are set at both HTTP client and policy levels
- Confirm circuit breakers have realistic thresholds
- Ensure retries only happen for idempotent operations
- Look for missing fallbacks on critical paths
// Minimum viable resilience configuration
builder.Services.AddHttpClient<IExternalService, ExternalService>()
.AddStandardResilienceHandler(); // .NET 8+ built-in
// Or explicit Polly configuration
builder.Services.AddHttpClient<IExternalService, ExternalService>()
.AddPolicyHandler(GetRetryPolicy())
.AddPolicyHandler(GetCircuitBreakerPolicy());
Anti-pattern 1: Retry without backoff
Immediate retries hammer a struggling service, making recovery harder.
The problem
// BAD: Immediate retries
builder.Services.AddHttpClient<IOrderService, OrderService>()
.AddTransientHttpErrorPolicy(p => p.RetryAsync(3));
Why it fails
When a service is struggling under load:
- Request fails
- Immediate retry adds more load
- Service becomes more overloaded
- More requests fail
- More immediate retries
- Service collapses
Three clients each retrying immediately creates 12x the load on a struggling service.
The fix
Use exponential backoff with jitter:
// GOOD: Exponential backoff with jitter
static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
return HttpPolicyExtensions
.HandleTransientHttpError()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) // 2, 4, 8 seconds
+ TimeSpan.FromMilliseconds(Random.Shared.Next(0, 1000)), // Jitter
onRetry: (outcome, timespan, retryAttempt, context) =>
{
// Log retry attempts
});
}
Detection
Search for RetryAsync without WaitAndRetryAsync:
grep -rn "RetryAsync" --include="*.cs" | grep -v "WaitAndRetry"
Anti-pattern 2: Retry non-idempotent operations
Retrying operations that aren't idempotent can cause duplicate side effects.
The problem
// BAD: Retrying POST that may not be idempotent
builder.Services.AddHttpClient<IPaymentService, PaymentService>()
.AddTransientHttpErrorPolicy(p =>
p.WaitAndRetryAsync(3, _ => TimeSpan.FromSeconds(2)));
// PaymentService
public async Task<PaymentResult> ChargeAsync(PaymentRequest request)
{
// If first request succeeds but response is lost,
// retry charges customer twice!
var response = await _httpClient.PostAsJsonAsync("/charge", request);
return await response.Content.ReadFromJsonAsync<PaymentResult>();
}
Why it fails
Consider this sequence:
- Client sends payment request
- Server processes payment, charges card
- Network timeout before response reaches client
- Client retries
- Server charges card again
Customer is charged twice. The operation succeeded, but the client didn't know.
The fix
Use idempotency keys and conditional retries:
// GOOD: Idempotency key prevents duplicate processing
public async Task<PaymentResult> ChargeAsync(PaymentRequest request)
{
// Generate or use provided idempotency key
var idempotencyKey = request.IdempotencyKey ?? Guid.NewGuid().ToString();
var httpRequest = new HttpRequestMessage(HttpMethod.Post, "/charge")
{
Content = JsonContent.Create(request)
};
httpRequest.Headers.Add("Idempotency-Key", idempotencyKey);
var response = await _httpClient.SendAsync(httpRequest);
return await response.Content.ReadFromJsonAsync<PaymentResult>();
}
// Only retry safe methods automatically
static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
return Policy<HttpResponseMessage>
.HandleResult(r => !r.IsSuccessStatusCode &&
r.StatusCode != HttpStatusCode.BadRequest && // Don't retry 400
r.StatusCode != HttpStatusCode.Conflict) // Don't retry 409
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
onRetryAsync: async (outcome, timespan, retryAttempt, context) =>
{
// Only allow retry if operation is idempotent
if (context.TryGetValue("IsIdempotent", out var isIdempotent)
&& isIdempotent is false)
{
throw new InvalidOperationException(
"Cannot retry non-idempotent operation");
}
});
}
Detection
Review retry policies for POST/PUT/PATCH operations:
// Red flags:
// - Global retry policy on HttpClient used for mutations
// - No idempotency headers on retry-able requests
// - Retrying 500 errors on payment/order endpoints
Anti-pattern 3: Circuit breaker with wrong thresholds
Circuit breakers with incorrect thresholds either never open or open too easily.
The problem
// BAD: Opens after one failure
var circuitBreaker = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 1,
durationOfBreak: TimeSpan.FromMinutes(1));
// BAD: Opens too late to help
var circuitBreaker = Policy
.HandleResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 1000,
durationOfBreak: TimeSpan.FromSeconds(5));
Why it fails
Circuit breaker opens after 1 failure:
- A single network glitch opens the circuit
- Service appears "down" when it's fine
- False positives cause unnecessary fallback activations
Circuit breaker opens after 1000 failures:
- By the time it opens, the downstream service is overwhelmed
- Not enough protection during actual outages
- The circuit never opens during realistic failure scenarios
The fix
Use failure rate thresholds appropriate for your traffic:
// GOOD: Percentage-based with minimum throughput
static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
return HttpPolicyExtensions
.HandleTransientHttpError()
.AdvancedCircuitBreakerAsync(
failureThreshold: 0.5, // 50% failure rate
samplingDuration: TimeSpan.FromSeconds(30),
minimumThroughput: 10, // Need 10 requests minimum
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (outcome, breakDelay) =>
{
// Log circuit opened
},
onReset: () =>
{
// Log circuit closed
});
}
Configuration guidance
| Traffic Level | Failure Threshold | Sampling Duration | Min Throughput |
|---|---|---|---|
| Low (< 10 req/min) | 0.25 - 0.5 | 60s | 5-10 |
| Medium (10-100/min) | 0.3 - 0.5 | 30s | 10-20 |
| High (> 100/min) | 0.1 - 0.3 | 10-30s | 20-50 |
Detection
Check circuit breaker configurations:
// Red flags:
// - handledEventsAllowedBeforeBreaking: 1 or 2
// - handledEventsAllowedBeforeBreaking: > 100
// - No minimum throughput on low-traffic services
// - durationOfBreak: less than 10 seconds
Anti-pattern 4: Missing or incorrect timeouts
Without proper timeouts, stuck requests consume resources indefinitely.
The problem
// BAD: No timeout - request can hang forever
builder.Services.AddHttpClient<ISlowService, SlowService>();
// BAD: Timeout longer than retry total
builder.Services.AddHttpClient<ISlowService, SlowService>(client =>
{
client.Timeout = TimeSpan.FromMinutes(5); // Way too long
});
// BAD: Timeout only at one level
builder.Services.AddHttpClient<ISlowService, SlowService>(client =>
{
client.Timeout = TimeSpan.FromSeconds(30);
})
.AddPolicyHandler(Policy.TimeoutAsync<HttpResponseMessage>(
TimeSpan.FromMinutes(2))); // Policy timeout longer than client!
Why it fails
- No timeout: threads/connections accumulate during outage
- Timeout too long: slow requests block request threads
- Mismatched timeouts: unpredictable behavior
The default HttpClient.Timeout is 100 seconds. That's usually too long for most API calls.
The fix
Layer timeouts appropriately:
// GOOD: Layered timeouts with proper ordering
builder.Services.AddHttpClient<IExternalService, ExternalService>(client =>
{
client.Timeout = TimeSpan.FromSeconds(30); // Overall client timeout
})
.AddPolicyHandler(Policy.TimeoutAsync<HttpResponseMessage>(
TimeSpan.FromSeconds(10))) // Per-attempt timeout (shorter)
.AddPolicyHandler(GetRetryPolicy()) // Retries with backoff
.AddPolicyHandler(GetCircuitBreakerPolicy());
// Timeout ordering (outer to inner):
// 1. HttpClient.Timeout - overall request timeout (30s)
// 2. Retry policy - includes all retry attempts
// 3. Per-attempt timeout - individual attempt (10s)
// 4. Circuit breaker - fails fast when open
Timeout math
Total time = (per_attempt_timeout + backoff) * retries
With 3 retries, 10s timeout, exponential backoff (2, 4, 8s):
Worst case = 10 + 2 + 10 + 4 + 10 + 8 + 10 = 54 seconds
Set HttpClient.Timeout slightly higher than worst case.
Detection
grep -rn "AddHttpClient" --include="*.cs"
# Then check if Timeout is configured
Anti-pattern 5: No fallback for critical paths
When downstream services fail, users see errors instead of degraded functionality.
The problem
// BAD: No fallback - failure bubbles to user
public async Task<ProductDetails> GetProductAsync(int productId)
{
var response = await _httpClient.GetAsync($"/products/{productId}");
response.EnsureSuccessStatusCode();
return await response.Content.ReadFromJsonAsync<ProductDetails>();
}
Why it fails
When the product service is down:
- User sees "Service Unavailable" error
- They can't complete their purchase
- Revenue lost during outage
The fix
Implement fallbacks for critical paths:
// GOOD: Fallback maintains user experience
public async Task<ProductDetails> GetProductAsync(int productId)
{
var policy = Policy<ProductDetails?>
.Handle<HttpRequestException>()
.OrResult(r => r == null)
.FallbackAsync(
fallbackAction: async ct => await GetCachedProductAsync(productId),
onFallbackAsync: async (outcome, context) =>
{
_logger.LogWarning(
"Product service unavailable, using cache for {ProductId}",
productId);
});
return await policy.ExecuteAsync(async () =>
{
var response = await _httpClient.GetAsync($"/products/{productId}");
if (!response.IsSuccessStatusCode)
return null;
return await response.Content.ReadFromJsonAsync<ProductDetails>();
});
}
private async Task<ProductDetails?> GetCachedProductAsync(int productId)
{
// Return cached version, even if stale
return await _cache.GetAsync<ProductDetails>($"product:{productId}");
}
Fallback strategies
| Scenario | Fallback Strategy |
|---|---|
| Product catalog | Return cached data (stale is better than error) |
| Search results | Return empty results with "search unavailable" message |
| Recommendations | Return popular/static recommendations |
| User profile | Return minimal profile with defaults |
| Payment | Queue for later processing, confirm tentatively |
Detection
Review error handling for external calls:
// Red flags:
// - EnsureSuccessStatusCode() without fallback
// - Exceptions propagating to controller
// - No cached alternative for read operations
Anti-pattern 6: Retry storms
Multiple services retrying the same failed downstream create exponential load.
The problem
Service A (3 retries) -> Service B (3 retries) -> Service C (down)
One request to A generates: 1 + 3 = 4 attempts to B
Each attempt to B generates: 1 + 3 = 4 attempts to C
Total attempts to C: 4 * 4 = 16
With 10 concurrent users: 160 requests hammering C
Why it fails
Each service layer multiplies retry attempts. A temporary failure becomes overwhelming load when the service tries to recover.
The fix
Limit total retry time across the chain:
// GOOD: Request-scoped retry budget
public class RetryBudget
{
public TimeSpan RemainingBudget { get; private set; }
public int RemainingAttempts { get; private set; }
public RetryBudget(TimeSpan totalBudget, int maxAttempts)
{
RemainingBudget = totalBudget;
RemainingAttempts = maxAttempts;
}
public bool CanRetry(TimeSpan attemptDuration)
{
RemainingBudget -= attemptDuration;
RemainingAttempts--;
return RemainingBudget > TimeSpan.Zero && RemainingAttempts > 0;
}
}
// Propagate budget through call chain
public class OrderService(HttpClient httpClient)
{
public async Task<Order> GetOrderAsync(int id, RetryBudget budget)
{
// Check budget before each retry
var policy = Policy<Order?>
.Handle<HttpRequestException>()
.RetryAsync(
retryCount: 3,
onRetry: (exception, retryCount, context) =>
{
if (!budget.CanRetry(TimeSpan.FromSeconds(2)))
{
throw new TimeoutException("Retry budget exhausted");
}
});
return await policy.ExecuteAsync(async () =>
{
var response = await httpClient.GetAsync($"/orders/{id}");
response.EnsureSuccessStatusCode();
return await response.Content.ReadFromJsonAsync<Order>();
});
}
}
Better approach: use hedging
// GOOD: Hedging sends parallel requests, uses first success
builder.Services.AddHttpClient<IOrderService, OrderService>()
.AddStandardHedgingHandler(options =>
{
options.MaxHedgedAttempts = 2;
options.Delay = TimeSpan.FromMilliseconds(200); // Start second request if first is slow
});
Detection
Review service-to-service retry configurations:
// Red flag: Multiple layers each with aggressive retries
// A -> B -> C -> D with 3 retries each = 81x amplification
Copy/paste artifact: production-ready resilience configuration
// Program.cs
using Microsoft.Extensions.Http.Resilience;
// .NET 8+ standard resilience (recommended)
builder.Services.AddHttpClient<IExternalService, ExternalService>()
.AddStandardResilienceHandler(options =>
{
// Retry: exponential backoff with jitter
options.Retry.MaxRetryAttempts = 3;
options.Retry.Delay = TimeSpan.FromSeconds(1);
options.Retry.UseJitter = true;
options.Retry.BackoffType = DelayBackoffType.Exponential;
// Circuit breaker: 50% failure rate over 30s
options.CircuitBreaker.FailureRatio = 0.5;
options.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
options.CircuitBreaker.MinimumThroughput = 10;
options.CircuitBreaker.BreakDuration = TimeSpan.FromSeconds(30);
// Timeout: 30s total, 10s per attempt
options.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(30);
options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(10);
});
// Or explicit Polly configuration
builder.Services.AddHttpClient<IPaymentService, PaymentService>(client =>
{
client.Timeout = TimeSpan.FromSeconds(30);
})
.AddPolicyHandler(GetRetryPolicy())
.AddPolicyHandler(GetCircuitBreakerPolicy())
.AddPolicyHandler(Policy.TimeoutAsync<HttpResponseMessage>(
TimeSpan.FromSeconds(10)));
static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
return HttpPolicyExtensions
.HandleTransientHttpError()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
+ TimeSpan.FromMilliseconds(Random.Shared.Next(0, 1000)));
}
static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
return HttpPolicyExtensions
.HandleTransientHttpError()
.AdvancedCircuitBreakerAsync(
failureThreshold: 0.5,
samplingDuration: TimeSpan.FromSeconds(30),
minimumThroughput: 10,
durationOfBreak: TimeSpan.FromSeconds(30));
}
Copy/paste artifact: resilience code review checklist
Resilience Code Review Checklist
1. Retry policies
- [ ] Exponential backoff with jitter (not immediate retries)
- [ ] Only idempotent operations retry automatically
- [ ] Non-idempotent operations use idempotency keys
- [ ] Retry count is reasonable (2-5)
2. Circuit breakers
- [ ] Failure threshold is percentage-based
- [ ] Minimum throughput prevents false positives
- [ ] Break duration allows recovery (15-60s)
- [ ] Logging on break/reset
3. Timeouts
- [ ] HttpClient.Timeout is set explicitly
- [ ] Per-attempt timeout < total timeout
- [ ] Timeout math accounts for retries + backoff
4. Fallbacks
- [ ] Critical paths have fallback behavior
- [ ] Fallbacks are logged
- [ ] Cached data used when fresh data unavailable
5. Retry storms
- [ ] Multi-hop calls have retry budget
- [ ] Total amplification factor is acceptable
- [ ] Hedging used instead of retries where appropriate
Common failure modes
- Retry storm: Multiple layers retrying create 10-100x load amplification
- Timeout cascade: Missing timeouts cause thread pool exhaustion
- Circuit never opens: Thresholds too high for actual traffic
- False circuit opens: Thresholds too low, service "flaps"
- Silent degradation: No fallbacks, no logging, users just see errors
Checklist
- All HTTP clients have explicit timeouts
- Retries use exponential backoff with jitter
- Circuit breakers use percentage-based thresholds
- Non-idempotent operations don't retry automatically
- Critical paths have fallback behavior
- Retry amplification is calculated for multi-hop calls
FAQ
Should I always retry on 5xx errors?
No. 500 Internal Server Error might indicate a bug that will fail every time. 503 Service Unavailable and 429 Too Many Requests are typically safe to retry. 502/504 gateway errors are often transient.
How long should circuit breaker stay open?
Start with 30 seconds. This gives the downstream service time to recover without keeping the circuit open so long that recovery is delayed.
What's the difference between retry and hedging?
Retry waits for failure, then tries again. Hedging starts parallel requests proactively (before knowing if the first will fail) and uses whichever responds first.
Should every HTTP client have resilience policies?
Yes, with appropriate configuration. Internal services might need different settings than external APIs. But every external call should have at least a timeout.
How do I test resilience policies?
Use chaos engineering tools (Polly's Simmy, or inject faults via middleware). Test: single failures, sustained failures, slow responses, partial failures.
What to do next
Review your HTTP client configurations today. Verify every external call has an explicit timeout and appropriate retry policy.
For more on building production-quality ASP.NET Core applications, read Health Checks & Graceful Shutdown.
If you want help reviewing resilience patterns in your codebase, reach out via Contact.
References
- Build resilient HTTP apps
- HTTP resiliency (Microsoft.Extensions.Http.Resilience)
- Polly documentation
- Implement HTTP call retries with exponential backoff
- Implement the Circuit Breaker pattern
- Microsoft.Extensions.Http.Resilience
Author notes
Decisions:
- Recommend .NET 8+
AddStandardResilienceHandleras primary approach. Rationale: built-in, well-tested defaults. - Always include jitter in retry backoff. Rationale: prevents synchronized retry thundering herd.
- Use percentage-based circuit breakers. Rationale: adapts to traffic volume automatically.
Observations:
- Teams often add retries after first timeout incident, creating retry storms.
- Circuit breaker thresholds copied from examples without adjusting for traffic.
- Fallbacks added during outages, not during design.
- Timeout issues often misdiagnosed as "slow services" instead of missing timeouts.