I’ve been writing a lot about the public cloud industry lately. In my previous two posts, I touched on Alibaba Cloud’s earnings vis-a-vis Microsoft’s Azure and Google’s GCP to assess their relative size, and a fairly comprehensive overview of the global data center coverage of AWS, Alibaba Cloud, Azure, and GCP to evaluate their relative strengths, weaknesses, architecture, and future plans.

“The Cloud” is not only a buzzword, it’s a real secular trend in the way technology will be built for decades to come. It’s also a highly complex product to evaluate, both as a user and an investor. One must-do element when you are “kicking the tire” on a public cloud platform is how well it handles “stress”.

Stress, in the cloud context, means, "can the cloud endure a sudden, unexpected spike in traffic or usage?" This happens all the time: a new streaming TV show launches, an online shopping holiday, a celebrity got divorced or married. There’s also the nefarious version: distributed denial of service attacks (DDOS).

The most recent example is Robinhood’s outage. Robinhood is a no-commission stock trading platform which grew quickly and became a Silicon Valley fintech darling. It was completely down on March 2 and for a portion of March 3, during one of the biggest single-day rises in the stock market, caused by COVID-19’s impact on the market that’s still ongoing. We don’t know yet exactly what happened. We do know that what caused the outage was  “stress on [Robinhood’s] infrastructure—which struggled with unprecedented load”. (Emphasis mine)

So how do you stress test when there’s no stress? One way is of course to simulate different scenarios and edge cases. There are methodologies for this, like Chaos Engineering. If you have the luxury, like Facebook, shutdown some of your own data centers randomly and periodically just to see what happens.

Another way to proxy a stress test is to see if a public cloud provider uses its own cloud for its other businesses. The nature of those “other businesses” can be quite telling when assessing the quality of their cloud services. For the “big four” public clouds -- AWS, Alibaba Cloud, Azure, GCP -- they are all started by large tech companies that have plenty of other successful businesses already, before they moved into the cloud game.

One core value proposition of a public cloud is its ability to guarantee that it’s “always on”, no matter what kind of stress or disaster happens. So let’s look at these four clouds from the nature of their own business as a form of stress test.


Amazon’s e-commerce business is the market leader in the U.S., Europe, and close to number 1 in India (Amazon’s market share is 31.2% v. Flipkart’s 31.9% in 2018, according to market research firm Forrester). It runs its own annual online shopping holiday, Prime Day, which sold $7.16 billion USD worth of goods in 2019, up 73% from 2018 ($4.17 billion USD). Amazon also has plenty of other businesses -- Alexa, Prime Video Streaming, fulfillment centers -- that always need to be on.

Amazon Services

So Amazon creates plenty of stress for its own system. Processing an e-commerce order is a complex transaction. A lot of different records need to be consistent and kept in sync at all times for an e-commerce platform to really work at scale. One completed Amazon Prime order likely means several record changes in:

  • Shopper’s account;
  • Credit card record;
  • Warehouse’s inventory;
  • Shipping records;
  • Clickstream data of that shopper’s behavior for advertisement; and
  • An overall source of truth database that stores all these activities as the definitive history of everything that just happened.

Ideally, all these records are updated accurately within milliseconds.

AWS is very much battle-tested and constantly “stressed out” by its parent company’s core e-commerce operation. It has moved all of its businesses onto AWS, and off of other systems like Oracle, after a multi-year effort. This gradual movement is standard and prudent practice. Moving infrastructure always causes disruption, and if your business needs to be “always on”, these disruptions could cause real damages, both financial and reputational.

It’s worth noting that the stress is expected so easier to prepare for. Whether it’s Prime Day, or Black Friday and Cyber Monday, you know in advance when these things will happen. AWS was created in fact from extra capacity that was provisioned for expected traffic spikes, but left unused, thus rented out, which became a public cloud. Amazon’s businesses are generally not prone to unexpected spikes in traffic, which happens more to social networks like Facebook, Twitter, and Weibo.

Alibaba Cloud

Alibaba’s core e-commerce business has many similarities to Amazon’s, even though Jack Ma likes to draw distinctions at every opportunity. Its various marketplaces, from Alibaba’s initial B2B iteration, to its currently consumer-facing product in Taobao and Tmall, all exhibit the level of stress and technical complexity that I’ve noted in Amazon’s e-commerce business. Alibaba also has its own payment gateway, AliPay, run out of its affiliate, Ant Financials, which Amazon does not. Technically speaking, financial transactions are even more complex to get right than shopping orders.

Alibaba Services

Alibaba also runs its own annual shopping holiday, Singles Day, that dwarfs Amazon’s in size, sales, and sheer volume. The 2019 Singles Day generated nearly $30.5 billion USD in sales for Alibaba; more than four-times bigger than Prime Day. Since Singles Day is a China-wide event, it impacts almost every online retailer, payment gateway, and bank.

The Alibaba team has been repeatedly highlighting the technical accomplishments that resulted from the 2019 Singles Day, in its most recent earning call and various online blog posts (in Chinese). It processed 544,000 orders per second at peak traffic, 970 petabytes of data, and defended six billion cyberattacks (I’m assuming these are DDOS attacks), during the 24-hour shopping spree. This accomplishment is well-deserved; Alibaba has basically created and survived the mother of all stress tests.

Alibaba has also migrated all of these businesses onto its own cloud, like Amazon, and off of older technologies, like Oracle database on specialized IBM servers with EMC storage, so-called “IOE”, a popular moniker and public enemy #1 in the Chinese IT circle. The creation story of Alibaba Cloud was in fact to move off of IOE to save infrastructure cost (in Chinese).

The stress tests that Alibaba Cloud has endured have the same “expected” nature as AWS’s; we all know when Singles Day will happen. Furthermore, the vast majority of traffic comes from within China, which limits the universe of scenarios and edge cases.

As I’ve written before, Alibaba Cloud’s data center footprint and redundancy is heavily concentrated in China. It is for sure a battle-tested system, perhaps even more so than AWS. Alibaba Cloud’s technical and operational expertise can certainly be applied in regions outside of China, but only until there’s customer demand and the data centers to serve it.


Google has arguably the only, truly global infrastructure, because its services and users are global. Its core services -- search, Gmail, Youtube, Map, etc. -- are used by all regions of the world, except for China, Iran, North Korea, Cuba, and a handful of Arab countries. And these services need to be always on, all the time. The GCP selling point is to bring “Google-scale” infrastructure to every company (whether you need it or not is a different story).

Google Services

Google’s stress test is materially different from the e-commerce scenarios of Amazon and Alibaba in two ways:

  1. Google‘s services cannot anticipate traffic spikes, unlike a planned shopping holiday, and must be ready wherever, whenever it happens. This is probably why Twitter is heavily invested in GCP, but not other cloud platforms, because it is most prone to unexpected usage spikes as well, like sudden world events, which could crash its service. This makes GCP a stronger system.
  2. Google’s products do not naturally lead to processing many complex transactions, like online shopping orders, offline delivery, or payments. Yes, Google does have a shopping service and payment gateway, but their sizes and traction are relatively small. The one service closest to mirroring that complexity is AdWords, which is of course an impressive product and the one cash cow that has continued to fuel Alphabet. Still, the AdWords system probably does not experience big spikes in traffic, planned or unplanned, to stress test the system. This makes GCP a weaker system for certain types of businesses that require complex transactions processed at scale and without down time.


Aside from Azure, Microsoft has sizable businesses in enterprise software (e.g. Office 365), personal computing (e.g. Surface, XBox), and an assortment of other revenue streams (e.g. LinkedIn). As you can see, none of these businesses have to be “always on”, in the same way that an e-commerce marketplace or a search engine needs to be on. No one is that mad if they can’t check Outlook in the middle of the night. They would be if they can’t get that big discount on an Alexa, or worse yet, lost on the road and need Google Map to take them home.

Microsoft Services

Of course, Azure is still doing amazingly well from a revenue and market share standpoint. This success has more to do with Microsoft’s years of experience in selling products into large enterprises and aggressively moving users of its non-cloud license-based products onto the same products that are now on-cloud and subscription-based. Microsoft is very good at being “enterprise ready”, but not that good at being “Internet ready”.

This deficiency shows in the way Azure’s various Regions are architected, which I discussed in detail in my previous post, “Where Are The Data Centers”. It has by far the most number of Single-AZ Regions, which has led to outages and issues that could’ve been avoided with a multi-AZ design. Multi-AZ Region is the default in AWS, GCP, and most of Alibaba Cloud.

That being said, Walmart signed a 5-year deal with Azure in 2018. A big part of its move to aggressively challenge Amazon’s e-commerce dominance in the U.S. is to be more technically capable by being on the cloud. (Currently, Amazon accounts for almost 40% of all e-commerce sales in the U.S., Walmart accounts for 5%.) Walmart will supposedly launch an Amazon Prime competitor soon, called Walmart+, to further intensify the competition, which will likely all be built on Azure. It will be interesting to see how Azure measures up to AWS in stress situations it has never needed to deal with before.

“Stress” Is What You Pay For

Data centers may be what you rent out, but it’s the operational expertise of the public cloud vendors that you pay for. How does a cloud vendor gain operational expertise? By handling the stress tests that come directly from its other businesses.

That “stress” is what you pay a cloud provider for, so you (ideally) never have to stress.

Nevertheless, every cloud has outages, no matter how battle-tested they are. All these four large cloud platforms have had multiple outages during their existence. And even if you are on the most rock-solid cloud platform humans can build, things can still screw up.

Considering the Robinhood outage that originally animated this post. Even though Robinhood’s infrastructure appears to be all on AWS, its own team did not architect its system well enough to handle the sudden spike in usage. More importantly, they did not do enough of their own stress test, it seems, to discover the system’s weaknesses before they were exposed to the world.

If you don’t do your own stress test, even the best cloud can’t save you.

If you like what you've read, please SUBSCRIBE to the Interconnected email list. New posts will be delivered to your inbox (twice per week). Follow and interact with me on: Twitter, LinkedIn.





最近的一个例子是Robinhood整个平台垮了。Robinhood是一个免费股票交易平台,这几年发展迅速,成为硅谷金融科技圈的宠儿。它的平台在3月2日整天和3月3日的部分时间完全崩溃了,更不凑巧的是那天也是美股历史产生最大单日涨幅的其中一天。(因为新型冠状病毒COVID-19给市场带来的大幅度波动仍在继续。)背后具体发生了什么,公众还不知道。但从Robinhood CEO的道歉信中,我们可以看到,导致服务中断的原因是“Robinhood的基础设施受到了压力,基础设施受到了前所未有的超负荷”。

那么,在没有现实压力的情况下,如何进行压力测试呢?一种方法即是模拟不同的场景。做这种模拟的方法有很多,比如混沌工程(Chaos Engineering)。如果你有像Facebook的规模,可以随机定期关闭自己数据中心的机器,来做类似模拟。


公有云的一个核心价值是,无论发生什么样的压力或灾难,它都有能力保持”always on”。因此,我们从他们其他业务的本质来看看这四个云厂商能不能扛压。


亚马逊的电商业务是美国、欧洲的领导者,在印度也接近第一(根据市场研究所Forrester的数据,Flipkart在2018年的印度市场份额为31.9%,亚马逊为31.2%)。亚马逊有自己一年一度的网络购物节,Prime Day。2019年销售额为71.6亿美元,比2018年增长73%(41.7亿美元)。亚马逊还有很多其他的业务——Alexa、Prime在线视频、物流中心等等。这些业务都需要“always on”。

Amazon Services


  • 购物者帐户;
  • 信用卡记录;
  • 仓库存货表;
  • 送货记录;
  • 该购物者点击流数据(为做广告产品)
  • 一个总数据库记录所有事情的历史备份


AWS在服务母公司的核心电商运营中,可算是“压力重重”。经过多年的努力,亚马逊已经将所有业务转移到AWS上,完全替掉甲骨文等其他老系统。这种循序渐进迁移方式是行业标准。移动基础设施总会出现问题,如果自己的业务需要“always on”,就更要小心。任何服务中断都可能导致极大的财务和声誉损害。

值得注意的是,应对一个可以预测的压力是比较容易的。无论是Prime Day,还是黑五和网购周一(Cyber Monday),都是提前就知道的事情。AWS起家背后的故事其实就是:额外安排了很多机器容量准备流量洪峰,但还是剩下很多系统资源,所以开始出租多余的资源,就这么变成了公有云。亚马逊的业务通常不会出现意外的流量增长,这种情况更多地发生在Facebook、Twitter和微博等社交网络上。



Alibaba Services

总所周知,阿里也经营着自己购物日,“光棍节”。从规模和销售额,到交易量上都让亚马逊相形见绌。2019年的光棍节为阿里创造了近305亿美元的销售额,是Prime Day的四倍多。由于光棍节是一个全中国的活动,它也影响到全国几乎所有的电商、电子支付端口和银行。






谷歌可以说是唯一一个拥有和需要真正全球化基础设施的公司。它的核心服务——搜索、Gmail、Youtube、地图等——和用户的确是全世界哪里都有、随时都有,当然除了中国、伊朗、朝鲜、古巴和少数阿拉伯国家以外。这些服务必须是“always on”。GCP的卖点就是把“Google 规模”的基础设施带给每家公司。至于你需不需要那就是另一回事了。

Google Services


  1. 谷歌的流量洪峰很难预测,不像安排好的购物节日,所以要做好随时随地可能发生的准备。这也许就是为什么Twitter大量投入与GCP的合作,而没投入其他云平台的原因。Twitter是最容易出现意外高峰的,世界大事都会在Twitter上突然爆发,影响服务的稳定。从这点看GCP是个更强大的系统。
  2. 谷歌的产品并不需要大规模处理复杂交易,如网购订单、离线送货或支付。虽然谷歌有自己购物服务和支付端口,但规模相对较小,用户相对较少。产品中复杂性与电商交易最接近的服务可能就是AdWords,一个非常成功的产品,推动Alphabet大船的摇钱树。尽管如此,AdWords作为一个拍卖广告的系统还是比较好控制的,很少经受流量洪峰的冲击。从这点看GCP对于某些类型的业务,尤其是处理模式复杂的业务,可能是个较弱的选择。


除了Azure之外,微软在企业服务软件(如Office 365)、个人应用(如Surface、XBox)和其他收入来源(如LinkedIn)都有相当规模的业务。但显而易见,这些业务都不必“always on”,和电商或搜索引擎的在线需求截然不同。半夜如果上不了Outlook,看不到公司邮件,没人会生气。但如果买不到打折的Alexa,或者开车迷路需要谷歌地图导航回家,那用户反应就不一样。

Microsoft Services


这一缺陷已经Azure数据中心Region的架构上有所体现。我在上一篇文章“数据中心都在哪儿?”中详细讨论了这一点。相比竞争对手,Azure有最多的单AZ Region,从而导致了严重的停机和服务问题,这是采用多AZ设计本可以避免的问题。在AWS、GCP和大部分阿里云中,多AZ Region设计是默认的。

不管怎么说,沃尔玛已在2018年与Azure签署了长达5年的合作协议。其挑战亚马逊在美国电商霸主地位的举措,很大一部分取决于通过云端技术来提高自己的技术能力。(目前,亚马逊占美国所有电商销售额的近40%,沃尔玛仅占5%。)沃尔玛近期将推出与Amazon Prime直接对抗的产品,叫Walmart+。这个产品很有可能都搭建在Azure上。AWS和Azure这两个云系统谁能抗压,谁能保持“always on“,见分晓的时刻就快到了。





回到Robinhood的宕机事件。尽管Robinhood的基础设施似乎都在AWS上,但自己的团队没有很好地构建后台系统来应对突如其来的流量洪峰 ,也没有做足够的模拟压力测试,先暴露了系统的弱点。