One of the most frequently discussed topics related to COVID-19’s multi-dimensional impact is the dramatic increase in “work from home” (WFH). A popular extension of that discussion is the various tools, applications, and technologies that are being used to make work from home...work! As it turns out, almost all of the commonly-used technologies for this purpose are either directly provided by large cloud platforms or run on one of them.
I’ve written quite a bit of analysis before on “stress testing” the various cloud platforms to assess their relative strengths and weaknesses, from looking at their other core businesses as a proxy, to the breadth and quality of their global data center coverage. Widespread WFH and demand for online entertainment and education, as more adults and children stay home, is giving every cloud an unprecedented stress test as we speak.
How are they faring so far?
[NOTE: if you are interested in more cloud-related analysis on Interconnected, please checkout the newsletter's Cloud Industry category.]
Azure Showing Cracks
Microsoft’s Azure is showing signs of weakness as demand surges for its various collaboration and workplace products, from Office 365 to Windows Virtual Desktop to Teams (the workplace messaging service that competes with Slack). One misleading headline that’s been pushed out by Microsoft and unfortunately picked up by various tech media outlets, is the “775% increase” in Azure cloud service usage in geographical areas that are most committed to some form of social distancing or shelter-in-place policy.
I don’t blame the Microsoft PR team for pushing a positive narrative to make their products look good; that’s their job. But putting my analyst hat on, this narrative when not fully contextualized and understood, papers over cracks in the Azure infrastructure that are already showing.
First, this 8-plus-fold usage increase is limited to places where shelter-in-place is in full force as of the end of March, which at least in the United States, is only a small part of the country with densely populated urban areas: New York, California, New Jersey, Michigan, etc. Touting the percentage increase number is a classic trick of spinning massive growth out of a small or unknown base.
Second, just in March alone, multiple issues have occurred on various Microsoft cloud services: twice in just the last few weeks for European users of Teams, customers not getting the capacity they need in the US East Region, and XBox Live going down during a time when online gaming is surely surging. To put the timing in perspective, America and Europe only started taking COVID-19 seriously on a societal level in March. Parts of Azure began to fail almost as soon as people’s behaviors began to change.
Third, the way Azure has been dealing with and communicating about resource prioritization and changing service-level guarantees with free and paid-tier customers tells me that it has less extra capacity than what you might assume with a hypercloud provider. Yes, it’s definitely the right thing to do to prioritize capacity for any service or application that is supporting healthcare or other important efforts related to combating COVID-19, as is the case with PowerBI, Microsoft’s big data analytics tool. That being said, the other workarounds of “limits on free offers”, recommending “customers use alternative regions...that may have less demand surge”, and just encouraging “any customers experiencing allocation failures to retry…” indicate that there is little untapped resources to bring online to meet this demand surge. The only way forward is shifting existing resources around, keeping your paying customers happy, and forgoing the usual generosity for free users that can only happen in good times.
All these cracks, and we may see more in the future, are not surprising. As I’ve noted in previous data center coverage and “stress test” analysis, Azure’s 54 live regions which are mostly of single availability zone (AZ) in their design is an architectural flaw. It provides less redundancy and reliability than a multi-AZ set up (usually three AZs), which is the default in AWS, GCP, and parts of Alibaba Cloud. Since Microsoft’s other businesses are either hardware or office software, with the exception of maybe XBox’s gaming, it generally lacks operational experience in running “always on” services and preparing for unanticipated traffic surges.
And all this is happening on Microsoft’s own home-grown services running on its own cloud, not even third party services built on top of Azure.
(Note: in order to avoid possible legal liability caused by the "775% increase" headline, Microsoft filed this 8K report to the SEC to clear up its motivation.)
AWS Standing Strong
Meanwhile, there has been little news coming out of AWS. In this case, no news is good news. AWS also sells its own workplace enterprise software, like Chime (videoconferencing, messaging) and WorkSpaces (remote desktop), while running plenty of 3rd party ones that have become household names during the COVID-19 induced surge in WFH: Zoom, Slack, the Atlassian suite, to name a few. (Worth noting: Zoom also uses Azure and its own data centers; proportion of workloads is not clear.) AWS also backs many entertainment services that people can’t live without while stuck at home: Netflix, Hulu, Twitch (owned by Amazon), Fortnite (owned by Epic Games, a big AWS customer).
Perhaps most important of all, AWS is the infrastructure for Amazon.com, Instacart, DoorDash and many others, whose delivery of grocery, meals, and other goods have made them essential services in keeping the self-quarantined population alive.
It’s safe to say that AWS runs a much bigger, if not more critical, chunk of the digital world than Azure. AWS has previously developed comprehensive processes for preventing outages during natural disasters and unexpected events, perhaps due to its own epic outages from before. And looks like it’s holding up well so far under the current crisis.
Myth of Adding Capacity During COVID-19
As for the other clouds: GCP, Alibaba Cloud, IBM, Oracle, Tencent Cloud, etc., not much COVID-19 related outages have been reported thus far. That’s partly due to the reality that these cloud vendors combined still run a smaller portion of the Internet than AWS and Azure; they are not big enough to be stress tested on the same level.
The standard external response from any cloud vendor that’s under resource constraint, like Azure, is that it will add more capacity as quickly as possible. There is no way to know for sure how much extra capacity each cloud has or should have. But we do know what they normally do to increase capacity.
From a technical angle, there are only three options. You either (1) build additional data centers and networking, which takes years, or (2) rack up more servers into your existing data centers, or (3) use software to boost throughput, performance, or multi-tenancy capacity in order to squeeze more out of the existing hardware. Under COVID-19, where human movement is limited, manufacturing capacity is constrained, especially in Asia where much of the servers are made, and supply chain and shipping capacity are reduced, the only near-term option is software.
And when the technical options are all exhausted, the only thing you can really do is use financial incentives and disincentives to control, limit, or shift usage, which appears to be what Azure is doing.
The economics of a cloud business, much like insurance, count on uncorrelated risks.
So when these cloud vendors' PR departments try to calm your concerns with their "we are adding capacity" talking point, know that what they realistically can do is pretty limited.
At the end of the day, the economics of a cloud business, much like insurance, count on uncorrelated risks. Ideally, a platform runs many businesses with some unused capacity on reserve but not too much to be wasted, so when some businesses’ usages spike, they can use the extra capacity, and those spikes don’t lead to other businesses spiking too. If the spikes are related, it’s manageable as long as it’s anticipated, as is the case of e-commerce shopping holidays like Singles Day or Black Friday -- provision extra capacity ahead of time and simulate traffic surges to stress test.
Of course, we live far from an ideal world right now. When something like COVID-19 triggers many events all at the same time, many of which are impossible to anticipate, as a cloud platform, you either have the capacity or you don’t.
(This post was updated on April 2, 2020, after initial publication, with a link to Microsoft's 8K filing to the SEC regarding its "775% increase" blog post and a update/correction to AWS's own workplace software offering, which I failed to mention in the original version.)
与新冠疫情对社会和经济影响相关的最热门的话题之一就是“在家工作”（Work From Home, WFH）的增加。这个话题的一个自然延伸就是讨论各种协助远程工作的工具、应用程序和技术。这些工具和云平台有着紧密的联系，目的常用的工具和软件要么由大型云平台直接提供，要么就在某个云平台上运行。
现有的这些裂缝（以后可能会更多），其实并不让人惊讶。正如我在之前的数据中心分析和“压力测试”分析文章中所指出的，Azure的54个Regions在设计上大多是一个AZ（Availability Zone），这种架构是有明显缺陷的。它的冗余性和可靠性远远不如多AZ设置（通常是三个AZ）来的好，这也是AWS、GCP和部分阿里云的默认设置。由于微软的其他业务除了XBox游戏之外要么是硬件，要么是办公软件，其团队缺乏运营“always on”的互联网服务和应对意外流量洪峰的实战经验。
与此同时，AWS最近几乎没有什么新闻。没消息就是好消息。AWS也卖自己的办公企业软件，比如Chime（视频会议，短信）和WorkSpaces（虚拟桌面），同时也运营许多在新冠疫情爆发后的WFH热潮中已经家喻户晓的第三方办公软件：Zoom、Slack、Atlassian套件等等。(值得提一下：Zoom也用Azure和自己建的数据中心，具体的分工比例我们无从得知。) AWS还支持许多人们因为锁在家里而无法摆脱的娱乐服务：Netflix、Hulu、Twitch（Amazon旗下）、Fortnite（Epic Games旗下，是AWS的大客户）。