The Daily Parker

Politics, Weather, Photography, and the Dog

Azure DNS failure causes widespread outage

Yesterday, Microsoft made an error making a nameserver delegation chage (where they switch computers for their internal address book), causing large swaths of Azure to lose track of itself:

Summary of impact: Between 19:43 and 22:35 UTC on 02 May 2019, customers may have experienced intermittent connectivity issues with Azure and other Microsoft services (including M365, Dynamics, DevOps, etc). Most services were recovered by 21:30 UTC with the remaining recovered by 22:35 UTC. 

Preliminary root cause: Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.

Mitigation: To mitigate, engineers corrected the nameserver delegation issue. Applications and services that accessed the incorrectly configured domains may have cached the incorrect information, leading to a longer restoration time until their cached information expired.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A detailed RCA will be provided within approximately 72 hours.

If you tried to get to the Daily Parker yesterday afternoon Chicago time, you might have gotten nothing, or gotten the whole blog. All I know is I spent half an hour tracking it down from my end before Microsoft copped to the problem.

That's not a criticism of Microsoft. In fact, they're a lot more transparent about problems like this than most other organizations. And having spent a lot of time trying to figure out why something has broken, half an hour doesn't seem like a lot of time.

So, bad for Microsoft that they tanked their entire universe with a misconfigured DNS server. Good for them that they fixed it completely in just over an hour.

Comments (2) -

  • David Harper

    5/4/2019 7:32:29 AM +00:00 |

    I suspect that there may be a welcome trend for major IT service providers to be more open about major FUBARs like this.  GitHub published a full explanation for their 24-hour outage back in October last year, for example, which revealed embarrassing shortcomings in their arrangements for failover.  Companies like GitHub and Microsoft know that they have to retain the confidence of the IT experts who are the direct users of their services, and IT experts don't like to be fed BS.  Full disclosure of your errors shows both that you respect your fellow techies and that you really do know what went wrong.

  • The Daily Parker

    5/4/2019 3:01:11 PM +00:00 |

    Don't forget, GitHub now *is* Microsoft.

Comments are closed