CTO Mark Russinovich speaks about Microsoft boosting Azure reliability
CTO Mark Russinovich speaks about Microsoft boosting Azure reliability

Microsoft Tackles Top Challenge: Boosting Azure’s Reliability

1596 0

With Azure having become Microsoft’s centerpiece for the future, the company’s making some huge investments around ML and other technologies to boost Azure’s reliability as more global corporations bet their businesses on it.

Several months ago, I wrote about some earlier steps Microsoft was taking to address Azure’s reliability. Called After 3 Cloud Failures in 12 Months, Microsoft Fortifies Azure, the piece drew from a blog post from Azure CTO Mark Russinovich. Russinovich had described a range of approaches Microsoft is taking to boost reliability without disrupting customers’ operations in doing so.

In light of those cloud failures during a time when many large global corporations are moving mission-critical workloads to the Azure cloud, I pegged Azure reliability as Microsoft’s #1 challenge for the coming year. 

Earlier this month, Russinovich posted an update on Microsoft’s ongoing efforts. In this round, there’s no question that Microsoft is banking on dynamic ML technology to provide new improvements and capabilities. (Russinovich wrote the opening portion of the post—called Advancing no-impact and low-impact maintenance technologies—and a few leaders from his team wrote the rest.)

In a section called “The future of Azure maintenance,” the post outlines Microsoft’s big bet on machine learning: “We are investing heavily in machine learning-based insights and automation to maintain availability and reliability. Eventually, this ‘AI Operations’ model will carry out preventative maintenance, initiate automated mitigations, and identify contributing factors and dependencies during incidents more effectively than our human engineers can.” 

But in the meantime, Microsoft hopes to “minimize customer impact when updating the fleet. Today, the vast majority of updates to the host operating system are deployed in place with absolute transparency and zero customer impact using hot patching. In infrequent cases in which the update cannot be hot patched, we typically utilize low-impact memory preserving update technologies to roll out the update… Thanks to continued investments in this space, we are at a point where the vast majority of host maintenance activities do not impact the VMs hosted on the affected infrastructure.”

Here are some additional highlights from this new post, which Russinovich said is intended to highlight “several initiatives underway to keep improving platform availability, as part of our commitment to provide a trusted set of cloud services.”

The post outlines 4 approaches: Plan A, which involves Hot Patching; Plan B, which centers on Memory-Preserving Maintenance; Plan C, which is about Self-Service Maintenance; and Plan D, which is Live Migration. Here’s a look at each:

Plan A: Hot Patching.

This technique allows customers to “make targeted changes to running code without incurring any downtime for customer VMs.” It’s considered a “no-impact update technology because it directs “all new invocations of a function on the host to an updated version of that function.” For applying updates, Microsoft said it has been using hot patching wherever possible to avoid “any impact to the VMs running on that host.” Microsoft first began using hot patching in Azure in 2017, extending its use from the hypervisor to firmware hot patches sometime down the road. And since “some large host updates contain changes that cannot be applied using function-level hot patching…we endeavor to use memory-preserving maintenance.”

Plan B: Memory-preserving maintenance.

This approach “involves ‘pausing’ the guest VMs (while preserving their memory in RAM), updating the host server, then resuming the VMs and automatically synchronizing their clocks,” the post says. Since memory-preserving maintenance in Azure made its debut in 2018, 3 important improvements have been made. First, host reboots are not always required. Second, the length of the “customer pause” has been reduced. And finally, the technology’s now available on more types of VMs.  

Plan C: Self-service maintenance.

This option gives customers a specific window of time “within which they can choose when to initiate impactful maintenance on their VM(s). This initial self-service phase typically lasts around a month and empowers organizations to perform the maintenance on their own schedules so it has no or minimal disruption to users.” Afterward, Azure begins to perform maintenance automatically. The post also describes “rebootless updates” that result in “pauses” of only a few seconds. This approach is “useful for VMs running ultra-sensitive workloads which can’t sustain any interruption even if it lasts just for a few seconds.”

Plan D: Live migration.

This entails “moving a running customer VM from one ‘source’ host to another ‘destination’ host… Once most of the local state is moved, the guest VM experiences a short pause usually lasting five seconds or less. After that pause, the VM resumes running on the destination host… Today, when Azure Machine Learning algorithms predict an impending hardware failure, live migration can be used to move guest VMs onto different hosts preemptively.”

Subscribe to the Cloud Wars Newsletter for in-depth analysis of the major cloud vendors from the perspective of business customers. It’s free, it’s exclusive, and it’s great!