Microsoft is working to improve Azure reliability across its 54 cloud regions, said Mark Russinovich, CTO of Microsoft Azure, in a blog post Monday. As part of the effort, Microsoft is adding a quality engineering team into the CTO's office, which will work with the site reliability engineering team.
In the last year, Azure averaged a 99.995% uptime across its global cloud infrastructure, according to Russinovich. To provide more redundancy in the event of an incident, Microsoft has availability zones in its 10 largest Azure regions. By 2021, it plans to add availability zones to the next 10 largest regions.
Microsoft is working to deploy low or zero impact maintenance, which includes "hot patching, live migration, and in-place migration," Russinovich said. Many of the patches Azure has deployed in the last year had "no customer impact or downtime."
For a cloud service provider, uptime is the backbone of the business. If a region goes down, cascading failures trickle to end customers, which in turn lose business functionality (and can directly correlate to increased executive stress.)
Three "significant incidents" — the result of "multiple failures" — has Microsoft turning attention to improving uptime, Russinovich said:
The September data center outage in Azure's South Central U.S. region
Dual multifactor authentication outages on Azure Active Directory in November
May DNS maintenance issues, which caused network connectivity interruptions
Azure operates on an enormous scale, a favored cloud provider for many large-scale enterprises. "More than 95% of the Fortune 500 run their workloads on our cloud," said Microsoft CEO Satya Nadella, during its Q3 earnings call in April.
The segment is big business for Microsoft. In Q3, its Intelligent Cloud segment had revenue of $9.7 billion, up 22% year-over-year. The segment includes server products, cloud services and enterprise services.
Microsoft's uptime-improving efforts come after a number of service providers have struggled with site reliability. Shortly after its debut as a publicly traded company, Slack experienced "degraded services." Cloudflare too had an outage, which highlighted how delicately the internet is constructed.