Resilience Anti-Patterns: The Missing Retries That Cause Outages

Q: Why do my retries make downstream services slower?

Retry storms. When a service is struggling, every caller retrying multiplies the load, making recovery impossible. Fix with exponential backoff, jitter, and circuit breakers that stop retrying when failures are persistent.

Q: Should I always retry on 5xx errors?

No. 500 Internal Server Error might indicate a bug that will fail every time. 503 Service Unavailable and 429 Too Many Requests are typically safe to retry. 502/504 gateway errors are often transient.

Q: How long should circuit breaker stay open?

Start with 30 seconds. This gives the downstream service time to recover without keeping the circuit open so long that recovery is delayed.

Q: What's the difference between retry and hedging?

Retry waits for failure, then tries again. Hedging starts parallel requests proactively (before knowing if the first will fail) and uses whichever responds first.

Q: Should every HTTP client have resilience policies?

Yes, with appropriate configuration. Internal services might need different settings than external APIs. But every external call should have at least a timeout.

Q: How do I test resilience policies?

Use chaos engineering tools (Polly's Simmy, or inject faults via middleware). Test: single failures, sustained failures, slow responses, partial failures.

Abstract

TL;DR The 6 resilience mistakes that turn transient failures into outages: missing retries, retry storms, broken circuit breakers, no timeouts, and missing fallbacks.