Cloud Wars analyzes Microsoft Azure reliability, a challenge for 2020
Cloud Wars analyzes Microsoft Azure reliability, a challenge for 2020

Microsoft’s Most-Dangerous Challenge in 2020

5303 0

While my #1 Cloud Wars story from 2019 is Microsoft’s excellent chance of topping $50 billion in cloud revenue for its fiscal 2020, the biggest threat to that achievement is an internal issue the company’s working hard to address.

And that near-existential threat is the level of reliability of its Azure cloud.

Six months ago, in a piece called After 3 Cloud Failures in 12 Months, Microsoft Fortifies Azure Reliability, I analyzed a blog post from Azure CTO Mark Russinovich describing those “incidents” and Microsoft’s actions in addressing them.

Throughout 2019, it was remarkable to see the torrid growth generated by Microsoft and its Azure cloud among businesses large and small but particularly among some of the world’s largest corporations. To get a sense of that market-leading momentum and why Microsoft has become not only the unquestioned leader among the Cloud Wars Top 10 but also the most-influential tech vendor on the planet, have a look at some of these customer-driven pieces we posted on Cloud Wars in 2019:

But all of that growth will quickly crumble if Azure’s reputation for reliability is anything less than superb.

I have not been able to find an update from Russinovich and Microsoft on issues he outlined back in July. But here’s a recap of the key findings that I outlined in the piece noted above about the 3 Azure failures in a 12-month period. The excerpts within quotation marks are taken directly from Russinovich’s blog post.

  • Here’s how Russinovich described the three failures: “However, at the scale Azure operates, we recognize that uptime alone does not tell the full story. We experienced three unique and significant incidents that impacted customers during this time period, a datacenter outage in the South Central US region in September 2018, Azure Active Directory (Azure AD) Multi-Factor Authentication (MFA) challenges in November 2018, and DNS maintenance issues in May 2019.”
  • “Improve our understanding”: “Outages and other service incidents are a challenge for all public cloud providers, and we continue to improve our understanding of the complex ways in which factors such as operational processes, architectural designs, hardware issues, software flaws, and human factors can align to cause service incidents.”
  • “Multiple Failures” and “Intricate Interactions”: “All three of the incidents mentioned were the result of multiple failures that only through intricate interactions led to a customer-impacting outage. In response, we are creating better ways to mitigate incidents through steps such as redundancies in our platform, quality assurance throughout our release pipeline, and automation in our processes.”
  • Within Russinovich’s CTO office, Microsoft has created a Quality Engineering team that will work closely with the existing Site Reliability Engineering team to explore and create innovative reliability solutions. Safe deployment: Aimed at ensuring “that all code and configuration changes go through a cycle of specific stages,” Microsoft has expanded this initiative to include software-defined infrastructure changes such as networking and DNS, Russinovich wrote.
  • Storage-account level failover: This one’s worth reading in full: “During the September 2018 datacenter outage, several storage stamps were physically damaged, requiring their immediate shut down. Because it is our policy to prioritize data retention over time-to-restore, we chose to endure a longer outage to ensure that we could restore all customer data successfully. A number of you have told us that you want more flexibility to make this decision for your own organizations, so we are empowering customers by previewing the ability to initiate your own failover at the storage-account level.”
  • Fault injection and stress testing: I also recommend reading this one in Russinovich’s own words: “Validating that systems will perform as designed in the face of failures is possible only by subjecting them to those failures. We’re increasingly fault injecting our services before they go to production, both at a small scale with service-specific load stress and failures, but also at regional and AZ scale with full region and AZ failure drills in our private canary regions. Our plan is to eventually make these fault injection services available to customers so that they can perform the same validation on their own applications and services.”

And here’s what I wrote in the conclusion to my analysis back in July:

Clearly, pushing reliability upward from 99.995% is a big challenge. But implicit as well as explicit in Microsoft’s promise to customers of its Azure cloud is that Microsoft’s size, scale, technological expertise and financial resources will shield those customers from the disruptive chaos of modern enterprise technology.

And if Microsoft intends to retain its #1 spot in the Cloud Wars, all of its cloud customers—those who’ve been affected by those 3 “incidents” as well as those that haven’t—will be demanding that the plans outlined by Russinovich become reality.

And that they do so quickly.

Since Russinovich’s extremely important post six months ago, he’s published two smaller-scale updates: one last month on Azure Active Directory availability and one in August on Azure Virtual Machines resiliency.

It’s high time for a complete and thorough update from Russinovich on all the issues he outlined in that compelling July post

Because this issue is not only essential to Microsoft’s own growth and success but also to that of the vast numbers of businesses across the globe that are trusting Microsoft Azure to live up to its #1 billing.

Cloud Wars

Top 10 Rankings — Dec. 30, 2019

1. Microsoft — Major IaaS deal w/#3 Salesforce follows Q1 cloud revenue of $11.6B
2. Amazon — Usually the disruptor, Jassy disrupted by Salesforce-Microsoft deal
3. Salesforce — Big Dreamforce dreams: How Benioff Plans To Defeat Oracle and SAP
4. SAP — New co-CEOs unify priorities, push customer success
5. Oracle — Ellison and SAP both claim #1 spot in enterprise apps—which will it be?
6. Google — Thomas Kurian is our CEO of the Year for driving customer-first culture
7. IBM — Q3 cloud rev. up 14% to $5B; BofA says cloud has saved billions in IT costs
8. Workday —Bhusri says Financials will become bigger than flagship HR business
9. ServiceNow — Bill McDermott takes over as CEO as Q3 revenue reaches $900M
10. AccentureNo longer breaking out cloud revenue but probably close to $10B

Subscribe to the Cloud Wars Newsletter for in-depth analysis of the major cloud vendors from the perspective of business customers. It’s free, it’s exclusive, and it’s great!