AWS US-EAST-1 Outage: How a Race Condition in DynamoDB’s DNS Workflow Triggered a Major Failure

Ethan Cole
Ethan Cole I’m Ethan Cole, a digital journalist based in New York. I write about how technology shapes culture and everyday life — from AI and machine learning to cloud services, cybersecurity, hardware, mobile apps, software, and Web3. I’ve been working in tech media for over 7 years, covering everything from big industry news to indie app launches. I enjoy making complex topics easy to understand and showing how new tools actually matter in the real world. Outside of work, I’m a big fan of gaming, coffee, and sci-fi books. You’ll often find me testing a new mobile app, playing the latest indie game, or exploring AI tools for creativity.
3 min read 76 views
AWS US-EAST-1 Outage: How a Race Condition in DynamoDB’s DNS Workflow Triggered a Major Failure

The AWS US-EAST-1 outage exposed a critical race condition inside DynamoDB’s DNS automation, disrupting multiple services across the region. The incident, which unfolded on October 19–20, caused widespread failures and ignited new discussions about cloud reliability, multi-region design, and dependency chains inside AWS.

What Caused the AWS US-EAST-1 Outage

AWS engineering teams revealed that the outage started when a latent race condition inside the DynamoDB DNS management workflow produced an empty DNS record for the regional endpoint. Because of this error, services attempting to reach DynamoDB immediately began to fail. As a result, major AWS components depending on the database—including EC2, Lambda, and Fargate—experienced extended disruption.

AWS stated: “The root cause was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record that the automation failed to repair.”

How the Failure Spread Across AWS Services

The visible DNS failure was only the first symptom. Internally, new EC2 instances continued to launch at the hypervisor layer. However, their network configuration did not complete because key state information stored in DynamoDB remained unreachable. Consequently, Network Load Balancer operations also degraded. This created a chain reaction across multiple AWS services that rely on timely network state propagation.

Industry expert Yan Cui noted that DNS was not the true root cause, but merely the first external signal of deeper DynamoDB automation issues.

AWS Immediate Fixes and Long-Term Improvements

AWS moved quickly to apply several immediate fixes. The company disabled the DynamoDB DNS Planner and DNS Enactor automation globally. In addition, AWS began redesigning the system to eliminate the race condition and added safeguards to prevent incorrect DNS plan updates.

Further improvements will include:

  • velocity control for Network Load Balancers to slow capacity removal during failover
  • stronger throttling in EC2 networking propagation systems
  • broader architectural changes to reduce cross-service impact

These steps aim to reduce the likelihood of similar region-wide disruptions.

How Engineers Responded Across the Industry

The outage sparked intense discussion in the engineering community. Some practitioners focused on the outage’s length, while others highlighted the region’s long-term reliability record. For example, DevOps consultant Roman Siewko noted that despite the highly visible 15-hour disruption, US-EAST-1 has maintained between 99.84% and 99.95% uptime over the past five years.

Additionally, Mudassir Mustafa emphasized that reliability engineering often suffers from memory bias. Teams react strongly to rare, dramatic failures while overlooking the continuous operational work that maintains stability most of the year.

Lessons Learned from the AWS US-EAST-1 Outage

The incident reinforces several critical architectural lessons. First, multi-AZ designs do not replace multi-region resilience. Second, DNS automation remains a sensitive layer in large-scale distributed systems. Finally, dependency chains inside cloud platforms can amplify small internal defects into region-wide failures.

For teams conducting internal reviews, AWS maintains a detailed event history page documenting every impacted service and its recovery timeline.

Read also

Join the discussion in our Facebook community.

Share this article: