Netflix has shared a new look at how it protects its global streaming platform from unexpected failures, revealing the latest evolution of its Netflix reliability strategy. During QCon San Francisco 2025, Netflix engineers described how the company redesigned its load-shedding approach to keep streams stable even when traffic surges far beyond predicted levels.
Why Netflix reliability strategy needed a new direction
Traffic spikes are one of Netflix’s biggest challenges. When a major series or film premieres, millions of users press “play” at the same time. Autoscaling helps, but it cannot always react fast enough. Building massive capacity “just in case” is also inefficient and extremely expensive.
Because of this, Netflix realized it needed more than scaling. It needed a system that could make smart decisions under pressure. That led to a new model that revolves around two internal buffers:
- Success buffer: extra capacity available before performance begins to drop.
- Failure buffer: controlled space used to reject some requests so the service can remain stable.
The new model focuses on using the failure buffer strategically, shedding load in a way that protects the user experience instead of degrading everything at once.
How request priority improves Netflix reliability strategy
Earlier versions of load shedding treated all requests the same. The system dropped traffic randomly when overwhelmed. However, Netflix identified a key insight: not all requests are equally important.
During overload, background or low-value operations should be sacrificed first. The system must keep user-initiated playback alive at all costs.
Netflix now assigns each request a clear priority:
- High priority: user presses “play,” or critical write operations
- Low priority: prefetching, background tasks, nonessential reads
This prioritization means the platform can drop low-value traffic quickly, preserving the experience for viewers and reducing the risk of widespread failure.
Why Netflix shifted load-shedding decisions to services
A major shift in the Netflix reliability strategy came when the company moved shedding decisions from the API gateway to individual services. This allowed application instances to repurpose capacity internally. High-priority work can temporarily “borrow” resources from lower-priority tasks within the same service.
This approach also protects backend-to-backend calls and batch jobs, which do not pass through the gateway. As a result, resilience improves across the entire architecture, not just at the edge.
Automation as the backbone of the Netflix reliability strategy
To make this work across hundreds of microservices, Netflix built an automated platform based on three principles:
1. Priority classification
Priority is assigned early and carried through the entire call chain. The system prevents services from raising their own priority but allows them to lower it when appropriate.
2. Central configuration
Each cluster receives an automatically generated load-shedding function. It maps CPU, latency, or concurrency levels to rejection probabilities. Non-critical traffic may be shed at moderate utilization, while critical shedding activates closer to saturation.
3. Continuous validation
Netflix uses CHAP (Chaos Automation Platform) and failure injection testing to validate configurations before major content launches. This ensures each service has enough success and failure buffer to handle real-world stress.
Preventing retry storms that worsen outages
One of the biggest dangers during overload is a retry storm. When clients repeatedly retry failed requests, they unintentionally amplify the overload.
Netflix introduced prioritized retry logic that adapts during heavy shedding:
- Non-critical retries pause completely
- High-priority retries remain allowed but are rate-limited
- Retries resume gradually as load stabilizes
This prevents cascading failures and ensures critical requests still have a path to success once the system recovers.
The broader impact of Netflix’s reliability innovations
At QCon, engineers highlighted several key lessons:
- Load shedding is a safety mechanism, not a failure—it keeps the system running.
- Prioritization protects the viewing experience, ensuring users can still watch content during partial overload.
- Automation allows reliability to scale, even as the platform grows more complex.
Netflix’s new approach demonstrates how modern distributed systems can remain resilient without overprovisioning or sacrificing performance.
Conclusion
The evolution of the Netflix reliability strategy reflects the company’s ongoing commitment to availability at massive scale. By blending priority-based decisions, automated configuration, and continuous experimentation, Netflix has built a system capable of surviving sudden traffic surges without compromising the viewing experience. As the platform grows, this strategy will remain central to keeping streams smooth for millions of users worldwide.
Read also
Join the discussion in our Facebook community.