I’ve been writing a lot about the public cloud industry lately. In my previous two posts, I touched on Alibaba Cloud’s earnings vis-a-vis Microsoft’s Azure and Google’s GCP to assess their relative size, and a fairly comprehensive overview of the global data center coverage of AWS, Alibaba Cloud, Azure, and GCP to evaluate their relative strengths, weaknesses, architecture, and future plans.
“The Cloud” is not only a buzzword, it’s a real secular trend in the way technology will be built for decades to come. It’s also a highly complex product to evaluate, both as a user and an investor. One must-do element when you are “kicking the tire” on a public cloud platform is how well it handles “stress”.
Stress, in the cloud context, means, "can the cloud endure a sudden, unexpected spike in traffic or usage?" This happens all the time: a new streaming TV show launches, an online shopping holiday, a celebrity got divorced or married. There’s also the nefarious version: distributed denial of service attacks (DDOS).
The most recent example is Robinhood’s outage. Robinhood is a no-commission stock trading platform which grew quickly and became a Silicon Valley fintech darling. It was completely down on March 2 and for a portion of March 3, during one of the biggest single-day rises in the stock market, caused by COVID-19’s impact on the market that’s still ongoing. We don’t know yet exactly what happened. We do know that what caused the outage was “stress on [Robinhood’s] infrastructure—which struggled with unprecedented load”. (Emphasis mine)
So how do you stress test when there’s no stress? One way is of course to simulate different scenarios and edge cases. There are methodologies for this, like Chaos Engineering. If you have the luxury, like Facebook, shutdown some of your own data centers randomly and periodically just to see what happens.
Another way to proxy a stress test is to see if a public cloud provider uses its own cloud for its other businesses. The nature of those “other businesses” can be quite telling when assessing the quality of their cloud services. For the “big four” public clouds -- AWS, Alibaba Cloud, Azure, GCP -- they are all started by large tech companies that have plenty of other successful businesses already, before they moved into the cloud game.
One core value proposition of a public cloud is its ability to guarantee that it’s “always on”, no matter what kind of stress or disaster happens. So let’s look at these four clouds from the nature of their own business as a form of stress test.
Amazon’s e-commerce business is the market leader in the U.S., Europe, and close to number 1 in India (Amazon’s market share is 31.2% v. Flipkart’s 31.9% in 2018, according to market research firm Forrester). It runs its own annual online shopping holiday, Prime Day, which sold $7.16 billion USD worth of goods in 2019, up 73% from 2018 ($4.17 billion USD). Amazon also has plenty of other businesses -- Alexa, Prime Video Streaming, fulfillment centers -- that always need to be on.
So Amazon creates plenty of stress for its own system. Processing an e-commerce order is a complex transaction. A lot of different records need to be consistent and kept in sync at all times for an e-commerce platform to really work at scale. One completed Amazon Prime order likely means several record changes in:
- Shopper’s account;
- Credit card record;
- Warehouse’s inventory;
- Shipping records;
- Clickstream data of that shopper’s behavior for advertisement; and
- An overall source of truth database that stores all these activities as the definitive history of everything that just happened.
Ideally, all these records are updated accurately within milliseconds.
AWS is very much battle-tested and constantly “stressed out” by its parent company’s core e-commerce operation. It has moved all of its businesses onto AWS, and off of other systems like Oracle, after a multi-year effort. This gradual movement is standard and prudent practice. Moving infrastructure always causes disruption, and if your business needs to be “always on”, these disruptions could cause real damages, both financial and reputational.
It’s worth noting that the stress is expected so easier to prepare for. Whether it’s Prime Day, or Black Friday and Cyber Monday, you know in advance when these things will happen. AWS was created in fact from extra capacity that was provisioned for expected traffic spikes, but left unused, thus rented out, which became a public cloud. Amazon’s businesses are generally not prone to unexpected spikes in traffic, which happens more to social networks like Facebook, Twitter, and Weibo.
Alibaba’s core e-commerce business has many similarities to Amazon’s, even though Jack Ma likes to draw distinctions at every opportunity. Its various marketplaces, from Alibaba’s initial B2B iteration, to its currently consumer-facing product in Taobao and Tmall, all exhibit the level of stress and technical complexity that I’ve noted in Amazon’s e-commerce business. Alibaba also has its own payment gateway, AliPay, run out of its affiliate, Ant Financials, which Amazon does not. Technically speaking, financial transactions are even more complex to get right than shopping orders.
Alibaba also runs its own annual shopping holiday, Singles Day, that dwarfs Amazon’s in size, sales, and sheer volume. The 2019 Singles Day generated nearly $30.5 billion USD in sales for Alibaba; more than four-times bigger than Prime Day. Since Singles Day is a China-wide event, it impacts almost every online retailer, payment gateway, and bank.
The Alibaba team has been repeatedly highlighting the technical accomplishments that resulted from the 2019 Singles Day, in its most recent earning call and various online blog posts (in Chinese). It processed 544,000 orders per second at peak traffic, 970 petabytes of data, and defended six billion cyberattacks (I’m assuming these are DDOS attacks), during the 24-hour shopping spree. This accomplishment is well-deserved; Alibaba has basically created and survived the mother of all stress tests.
Alibaba has also migrated all of these businesses onto its own cloud, like Amazon, and off of older technologies, like Oracle database on specialized IBM servers with EMC storage, so-called “IOE”, a popular moniker and public enemy #1 in the Chinese IT circle. The creation story of Alibaba Cloud was in fact to move off of IOE to save infrastructure cost (in Chinese).
The stress tests that Alibaba Cloud has endured have the same “expected” nature as AWS’s; we all know when Singles Day will happen. Furthermore, the vast majority of traffic comes from within China, which limits the universe of scenarios and edge cases.
As I’ve written before, Alibaba Cloud’s data center footprint and redundancy is heavily concentrated in China. It is for sure a battle-tested system, perhaps even more so than AWS. Alibaba Cloud’s technical and operational expertise can certainly be applied in regions outside of China, but only until there’s customer demand and the data centers to serve it.
Google has arguably the only, truly global infrastructure, because its services and users are global. Its core services -- search, Gmail, Youtube, Map, etc. -- are used by all regions of the world, except for China, Iran, North Korea, Cuba, and a handful of Arab countries. And these services need to be always on, all the time. The GCP selling point is to bring “Google-scale” infrastructure to every company (whether you need it or not is a different story).
Google’s stress test is materially different from the e-commerce scenarios of Amazon and Alibaba in two ways:
- Google‘s services cannot anticipate traffic spikes, unlike a planned shopping holiday, and must be ready wherever, whenever it happens. This is probably why Twitter is heavily invested in GCP, but not other cloud platforms, because it is most prone to unexpected usage spikes as well, like sudden world events, which could crash its service. This makes GCP a stronger system.
- Google’s products do not naturally lead to processing many complex transactions, like online shopping orders, offline delivery, or payments. Yes, Google does have a shopping service and payment gateway, but their sizes and traction are relatively small. The one service closest to mirroring that complexity is AdWords, which is of course an impressive product and the one cash cow that has continued to fuel Alphabet. Still, the AdWords system probably does not experience big spikes in traffic, planned or unplanned, to stress test the system. This makes GCP a weaker system for certain types of businesses that require complex transactions processed at scale and without down time.
Aside from Azure, Microsoft has sizable businesses in enterprise software (e.g. Office 365), personal computing (e.g. Surface, XBox), and an assortment of other revenue streams (e.g. LinkedIn). As you can see, none of these businesses have to be “always on”, in the same way that an e-commerce marketplace or a search engine needs to be on. No one is that mad if they can’t check Outlook in the middle of the night. They would be if they can’t get that big discount on an Alexa, or worse yet, lost on the road and need Google Map to take them home.
Of course, Azure is still doing amazingly well from a revenue and market share standpoint. This success has more to do with Microsoft’s years of experience in selling products into large enterprises and aggressively moving users of its non-cloud license-based products onto the same products that are now on-cloud and subscription-based. Microsoft is very good at being “enterprise ready”, but not that good at being “Internet ready”.
This deficiency shows in the way Azure’s various Regions are architected, which I discussed in detail in my previous post, “Where Are The Data Centers”. It has by far the most number of Single-AZ Regions, which has led to outages and issues that could’ve been avoided with a multi-AZ design. Multi-AZ Region is the default in AWS, GCP, and most of Alibaba Cloud.
That being said, Walmart signed a 5-year deal with Azure in 2018. A big part of its move to aggressively challenge Amazon’s e-commerce dominance in the U.S. is to be more technically capable by being on the cloud. (Currently, Amazon accounts for almost 40% of all e-commerce sales in the U.S., Walmart accounts for 5%.) Walmart will supposedly launch an Amazon Prime competitor soon, called Walmart+, to further intensify the competition, which will likely all be built on Azure. It will be interesting to see how Azure measures up to AWS in stress situations it has never needed to deal with before.
“Stress” Is What You Pay For
Data centers may be what you rent out, but it’s the operational expertise of the public cloud vendors that you pay for. How does a cloud vendor gain operational expertise? By handling the stress tests that come directly from its other businesses.
That “stress” is what you pay a cloud provider for, so you (ideally) never have to stress.
Nevertheless, every cloud has outages, no matter how battle-tested they are. All these four large cloud platforms have had multiple outages during their existence. And even if you are on the most rock-solid cloud platform humans can build, things can still screw up.
Considering the Robinhood outage that originally animated this post. Even though Robinhood’s infrastructure appears to be all on AWS, its own team did not architect its system well enough to handle the sudden spike in usage. More importantly, they did not do enough of their own stress test, it seems, to discover the system’s weaknesses before they were exposed to the world.
If you don’t do your own stress test, even the best cloud can’t save you.
If you like what you've read, please SUBSCRIBE to the Interconnected email list. New posts will be delivered to your inbox (twice per week). Follow and interact with me on: Twitter, LinkedIn.
亚马逊的电商业务是美国、欧洲的领导者，在印度也接近第一（根据市场研究所Forrester的数据，Flipkart在2018年的印度市场份额为31.9%，亚马逊为31.2%）。亚马逊有自己一年一度的网络购物节，Prime Day。2019年销售额为71.6亿美元，比2018年增长73%（41.7亿美元）。亚马逊还有很多其他的业务——Alexa、Prime在线视频、物流中心等等。这些业务都需要“always on”。
值得注意的是，应对一个可以预测的压力是比较容易的。无论是Prime Day，还是黑五和网购周一（Cyber Monday），都是提前就知道的事情。AWS起家背后的故事其实就是：额外安排了很多机器容量准备流量洪峰，但还是剩下很多系统资源，所以开始出租多余的资源，就这么变成了公有云。亚马逊的业务通常不会出现意外的流量增长，这种情况更多地发生在Facebook、Twitter和微博等社交网络上。
谷歌可以说是唯一一个拥有和需要真正全球化基础设施的公司。它的核心服务——搜索、Gmail、Youtube、地图等——和用户的确是全世界哪里都有、随时都有，当然除了中国、伊朗、朝鲜、古巴和少数阿拉伯国家以外。这些服务必须是“always on”。GCP的卖点就是把“Google 规模”的基础设施带给每家公司。至于你需不需要那就是另一回事了。
除了Azure之外，微软在企业服务软件（如Office 365）、个人应用（如Surface、XBox）和其他收入来源（如LinkedIn）都有相当规模的业务。但显而易见，这些业务都不必“always on”，和电商或搜索引擎的在线需求截然不同。半夜如果上不了Outlook，看不到公司邮件，没人会生气。但如果买不到打折的Alexa，或者开车迷路需要谷歌地图导航回家，那用户反应就不一样。
这一缺陷已经Azure数据中心Region的架构上有所体现。我在上一篇文章“数据中心都在哪儿？”中详细讨论了这一点。相比竞争对手，Azure有最多的单AZ Region，从而导致了严重的停机和服务问题，这是采用多AZ设计本可以避免的问题。在AWS、GCP和大部分阿里云中，多AZ Region设计是默认的。
不管怎么说，沃尔玛已在2018年与Azure签署了长达5年的合作协议。其挑战亚马逊在美国电商霸主地位的举措，很大一部分取决于通过云端技术来提高自己的技术能力。（目前，亚马逊占美国所有电商销售额的近40%，沃尔玛仅占5%。）沃尔玛近期将推出与Amazon Prime直接对抗的产品，叫Walmart+。这个产品很有可能都搭建在Azure上。AWS和Azure这两个云系统谁能抗压，谁能保持“always on“，见分晓的时刻就快到了。