I’ve been writing a lot about the public cloud industry lately. In my previous two posts, I touched on Alibaba Cloud’s earnings vis-a-vis Microsoft’s Azure and Google’s GCP to assess their relative size, and a fairly comprehensive overview of the global data center coverage of AWS, Alibaba Cloud, Azure, and GCP to evaluate their relative strengths, weaknesses, architecture, and future plans.

“The Cloud” is not only a buzzword, it’s a real secular trend in the way technology will be built for decades to come. It’s also a highly complex product to evaluate, both as a user and an investor. One must-do element when you are “kicking the tire” on a public cloud platform is how well it handles “stress”.

Stress, in the cloud context, means, "can the cloud endure a sudden, unexpected spike in traffic or usage?" This happens all the time: a new streaming TV show launches, an online shopping holiday, a celebrity got divorced or married. There’s also the nefarious version: distributed denial of service attacks (DDOS).

The most recent example is Robinhood’s outage. Robinhood is a no-commission stock trading platform which grew quickly and became a Silicon Valley fintech darling. It was completely down on March 2 and for a portion of March 3, during one of the biggest single-day rises in the stock market, caused by COVID-19’s impact on the market that’s still ongoing. We don’t know yet exactly what happened. We do know that what caused the outage was  “stress on [Robinhood’s] infrastructure—which struggled with unprecedented load”. (Emphasis mine)

So how do you stress test when there’s no stress? One way is of course to simulate different scenarios and edge cases. There are methodologies for this, like Chaos Engineering. If you have the luxury, like Facebook, shutdown some of your own data centers randomly and periodically just to see what happens.

Another way to proxy a stress test is to see if a public cloud provider uses its own cloud for its other businesses. The nature of those “other businesses” can be quite telling when assessing the quality of their cloud services. For the “big four” public clouds -- AWS, Alibaba Cloud, Azure, GCP -- they are all started by large tech companies that have plenty of other successful businesses already, before they moved into the cloud game.

One core value proposition of a public cloud is its ability to guarantee that it’s “always on”, no matter what kind of stress or disaster happens. So let’s look at these four clouds from the nature of their own business as a form of stress test.

AWS

Amazon’s e-commerce business is the market leader in the U.S., Europe, and close to number 1 in India (Amazon’s market share is 31.2% v. Flipkart’s 31.9% in 2018, according to market research firm Forrester). It runs its own annual online shopping holiday, Prime Day, which sold $7.16 billion USD worth of goods in 2019, up 73% from 2018 ($4.17 billion USD). Amazon also has plenty of other businesses -- Alexa, Prime Video Streaming, fulfillment centers -- that always need to be on.

Amazon Services

So Amazon creates plenty of stress for its own system. Processing an e-commerce order is a complex transaction. A lot of different records need to be consistent and kept in sync at all times for an e-commerce platform to really work at scale. One completed Amazon Prime order likely means several record changes in:

  • Shopper’s account;
  • Credit card record;
  • Warehouse’s inventory;
  • Shipping records;
  • Clickstream data of that shopper’s behavior for advertisement; and
  • An overall source of truth database that stores all these activities as the definitive history of everything that just happened.

Ideally, all these records are updated accurately within milliseconds.

AWS is very much battle-tested and constantly “stressed out” by its parent company’s core e-commerce operation. It has moved all of its businesses onto AWS, and off of other systems like Oracle, after a multi-year effort. This gradual movement is standard and prudent practice. Moving infrastructure always causes disruption, and if your business needs to be “always on”, these disruptions could cause real damages, both financial and reputational.

It’s worth noting that the stress is expected so easier to prepare for. Whether it’s Prime Day, or Black Friday and Cyber Monday, you know in advance when these things will happen. AWS was created in fact from extra capacity that was provisioned for expected traffic spikes, but left unused, thus rented out, which became a public cloud. Amazon’s businesses are generally not prone to unexpected spikes in traffic, which happens more to social networks like Facebook, Twitter, and Weibo.

Alibaba Cloud

Alibaba’s core e-commerce business has many similarities to Amazon’s, even though Jack Ma likes to draw distinctions at every opportunity. Its various marketplaces, from Alibaba’s initial B2B iteration, to its currently consumer-facing product in Taobao and Tmall, all exhibit the level of stress and technical complexity that I’ve noted in Amazon’s e-commerce business. Alibaba also has its own payment gateway, AliPay, run out of its affiliate, Ant Financials, which Amazon does not. Technically speaking, financial transactions are even more complex to get right than shopping orders.

Alibaba Services

Alibaba also runs its own annual shopping holiday, Singles Day, that dwarfs Amazon’s in size, sales, and sheer volume. The 2019 Singles Day generated nearly $30.5 billion USD in sales for Alibaba; more than four-times bigger than Prime Day. Since Singles Day is a China-wide event, it impacts almost every online retailer, payment gateway, and bank.

The Alibaba team has been repeatedly highlighting the technical accomplishments that resulted from the 2019 Singles Day, in its most recent earning call and various online blog posts (in Chinese). It processed 544,000 orders per second at peak traffic, 970 petabytes of data, and defended six billion cyberattacks (I’m assuming these are DDOS attacks), during the 24-hour shopping spree. This accomplishment is well-deserved; Alibaba has basically created and survived the mother of all stress tests.

Alibaba has also migrated all of these businesses onto its own cloud, like Amazon, and off of older technologies, like Oracle database on specialized IBM servers with EMC storage, so-called “IOE”, a popular moniker and public enemy #1 in the Chinese IT circle. The creation story of Alibaba Cloud was in fact to move off of IOE to save infrastructure cost (in Chinese).

The stress tests that Alibaba Cloud has endured have the same “expected” nature as AWS’s; we all know when Singles Day will happen. Furthermore, the vast majority of traffic comes from within China, which limits the universe of scenarios and edge cases.

As I’ve written before, Alibaba Cloud’s data center footprint and redundancy is heavily concentrated in China. It is for sure a battle-tested system, perhaps even more so than AWS. Alibaba Cloud’s technical and operational expertise can certainly be applied in regions outside of China, but only until there’s customer demand and the data centers to serve it.

GCP

Google has arguably the only, truly global infrastructure, because its services and users are global. Its core services -- search, Gmail, Youtube, Map, etc. -- are used by all regions of the world, except for China, Iran, North Korea, Cuba, and a handful of Arab countries. And these services need to be always on, all the time. The GCP selling point is to bring “Google-scale” infrastructure to every company (whether you need it or not is a different story).

Google Services

Google’s stress test is materially different from the e-commerce scenarios of Amazon and Alibaba in two ways:

  1. Google‘s services cannot anticipate traffic spikes, unlike a planned shopping holiday, and must be ready wherever, whenever it happens. This is probably why Twitter is heavily invested in GCP, but not other cloud platforms, because it is most prone to unexpected usage spikes as well, like sudden world events, which could crash its service. This makes GCP a stronger system.
  2. Google’s products do not naturally lead to processing many complex transactions, like online shopping orders, offline delivery, or payments. Yes, Google does have a shopping service and payment gateway, but their sizes and traction are relatively small. The one service closest to mirroring that complexity is AdWords, which is of course an impressive product and the one cash cow that has continued to fuel Alphabet. Still, the AdWords system probably does not experience big spikes in traffic, planned or unplanned, to stress test the system. This makes GCP a weaker system for certain types of businesses that require complex transactions processed at scale and without down time.

Azure

Aside from Azure, Microsoft has sizable businesses in enterprise software (e.g. Office 365), personal computing (e.g. Surface, XBox), and an assortment of other revenue streams (e.g. LinkedIn). As you can see, none of these businesses have to be “always on”, in the same way that an e-commerce marketplace or a search engine needs to be on. No one is that mad if they can’t check Outlook in the middle of the night. They would be if they can’t get that big discount on an Alexa, or worse yet, lost on the road and need Google Map to take them home.

Microsoft Services

Of course, Azure is still doing amazingly well from a revenue and market share standpoint. This success has more to do with Microsoft’s years of experience in selling products into large enterprises and aggressively moving users of its non-cloud license-based products onto the same products that are now on-cloud and subscription-based. Microsoft is very good at being “enterprise ready”, but not that good at being “Internet ready”.

This deficiency shows in the way Azure’s various Regions are architected, which I discussed in detail in my previous post, “Where Are The Data Centers”. It has by far the most number of Single-AZ Regions, which has led to outages and issues that could’ve been avoided with a multi-AZ design. Multi-AZ Region is the default in AWS, GCP, and most of Alibaba Cloud.

That being said, Walmart signed a 5-year deal with Azure in 2018. A big part of its move to aggressively challenge Amazon’s e-commerce dominance in the U.S. is to be more technically capable by being on the cloud. (Currently, Amazon accounts for almost 40% of all e-commerce sales in the U.S., Walmart accounts for 5%.) Walmart will supposedly launch an Amazon Prime competitor soon, called Walmart+, to further intensify the competition, which will likely all be built on Azure. It will be interesting to see how Azure measures up to AWS in stress situations it has never needed to deal with before.

“Stress” Is What You Pay For

Data centers may be what you rent out, but it’s the operational expertise of the public cloud vendors that you pay for. How does a cloud vendor gain operational expertise? By handling the stress tests that come directly from its other businesses.

That “stress” is what you pay a cloud provider for, so you (ideally) never have to stress.

Nevertheless, every cloud has outages, no matter how battle-tested they are. All these four large cloud platforms have had multiple outages during their existence. And even if you are on the most rock-solid cloud platform humans can build, things can still screw up.

Considering the Robinhood outage that originally animated this post. Even though Robinhood’s infrastructure appears to be all on AWS, its own team did not architect its system well enough to handle the sudden spike in usage. More importantly, they did not do enough of their own stress test, it seems, to discover the system’s weaknesses before they were exposed to the world.

If you don’t do your own stress test, even the best cloud can’t save you.

If you like what you've read, please SUBSCRIBE to the Interconnected email list. New posts will be delivered to your inbox (twice per week). Follow and interact with me on: Twitter, LinkedIn.

压力测试公有云:阿里云、AWS、Azure、GCP

最近写了几篇关于公有云的文章,谈到了阿里云的收益与微软的Azure和谷歌的GCP的比较,来看看它们的相对规模,并对AWS、阿里云、Azure和GCP的全球数据中心覆盖范围进行了较全面的分析,以评估它们相对的优势、劣势、架构和未来布局。

“云”不仅是一个行业热点,而是一个真正的长期趋势。未来几十年新技术产品都会搭建在云上。作为用户和投资人,它也是个很复杂产品,很难评估。当你在评估一个公有云平台的时候,一件必须做的事情就是看它如何“承受压力”。

给云的”压力”是看平台能否承受突然的、意外的流量洪峰或使用高峰。这种情况经常发生:一部新电视剧火了,一个网络购物节,一个名人离婚或结婚了。还有更恶毒的版本:分布式拒绝服务攻击(DDOS)。

最近的一个例子是Robinhood整个平台垮了。Robinhood是一个免费股票交易平台,这几年发展迅速,成为硅谷金融科技圈的宠儿。它的平台在3月2日整天和3月3日的部分时间完全崩溃了,更不凑巧的是那天也是美股历史产生最大单日涨幅的其中一天。(因为新型冠状病毒COVID-19给市场带来的大幅度波动仍在继续。)背后具体发生了什么,公众还不知道。但从Robinhood CEO的道歉信中,我们可以看到,导致服务中断的原因是“Robinhood的基础设施受到了压力,基础设施受到了前所未有的超负荷”。

那么,在没有现实压力的情况下,如何进行压力测试呢?一种方法即是模拟不同的场景。做这种模拟的方法有很多,比如混沌工程(Chaos Engineering)。如果你有像Facebook的规模,可以随机定期关闭自己数据中心的机器,来做类似模拟。

还有一种压力测试的方式,就是看某公有云供应商是否将本公司的其他业务放在上面,时时刻刻地历练自己的云。在评估云服务质量时,这些“其他业务”的性质很说明问题。对于“四大”公有云——AWS、阿里云、Azure、GCP——来说,它们都是先已其他科技产品起价,做成功了后再开始进入公有云市场的。

公有云的一个核心价值是,无论发生什么样的压力或灾难,它都有能力保持”always on”。因此,我们从他们其他业务的本质来看看这四个云厂商能不能扛压。

AWS

亚马逊的电商业务是美国、欧洲的领导者,在印度也接近第一(根据市场研究所Forrester的数据,Flipkart在2018年的印度市场份额为31.9%,亚马逊为31.2%)。亚马逊有自己一年一度的网络购物节,Prime Day。2019年销售额为71.6亿美元,比2018年增长73%(41.7亿美元)。亚马逊还有很多其他的业务——Alexa、Prime在线视频、物流中心等等。这些业务都需要“always on”。

Amazon Services

因此亚马逊经常给自己的系统制造大量压力。从技术层面看,处理一个电商订单其实很复杂。许多不同的记录都需要保持一致、始终同步。一份亚马逊上下的订单可能意味着好几个记录都需要更新,比如:

  • 购物者帐户;
  • 信用卡记录;
  • 仓库存货表;
  • 送货记录;
  • 该购物者点击流数据(为做广告产品)
  • 一个总数据库记录所有事情的历史备份

理想情况下,所有这些记录都应在毫秒内准确完成更新。

AWS在服务母公司的核心电商运营中,可算是“压力重重”。经过多年的努力,亚马逊已经将所有业务转移到AWS上,完全替掉甲骨文等其他老系统。这种循序渐进迁移方式是行业标准。移动基础设施总会出现问题,如果自己的业务需要“always on”,就更要小心。任何服务中断都可能导致极大的财务和声誉损害。

值得注意的是,应对一个可以预测的压力是比较容易的。无论是Prime Day,还是黑五和网购周一(Cyber Monday),都是提前就知道的事情。AWS起家背后的故事其实就是:额外安排了很多机器容量准备流量洪峰,但还是剩下很多系统资源,所以开始出租多余的资源,就这么变成了公有云。亚马逊的业务通常不会出现意外的流量增长,这种情况更多地发生在Facebook、Twitter和微博等社交网络上。

阿里云

阿里巴巴的核心电商业务与亚马逊有许多相似之处,尽管马云经常努力地想与亚马逊划清区别。阿里的每个产品,从最初的B2B业务,到淘宝和天猫,都有我在亚马逊电商业务中提到的同样的压力点和技术复杂性。阿里还有支付宝(AliPay),而亚马逊并没有支付产品。从技术上讲,金融交易比网购交易还要更复杂些。

Alibaba Services

总所周知,阿里也经营着自己购物日,“光棍节”。从规模和销售额,到交易量上都让亚马逊相形见绌。2019年的光棍节为阿里创造了近305亿美元的销售额,是Prime Day的四倍多。由于光棍节是一个全中国的活动,它也影响到全国几乎所有的电商、电子支付端口和银行。

阿里团队在最近的季报和各种博客文章中不断强调2019年光棍节所验证的技术成就。在那天的疯狂购物中,最高峰流量每秒处理了544000个订单,970千兆字节的数据,并防御了60亿次网络攻击(我默认这些是DDOS攻击)。这成就当之无愧:阿里巴巴基本上创造并经受住了所有压力测试的极致版。

和亚马逊一样,阿里也把自己所有的业务迁移到了自己的阿里云上,脱胎换骨,替掉了IOE。十年前创造阿里云的最初动机,其实就是想有一天完全不用IOE,来节省设施成本。

阿里云经受的压力测试与AWS的“可预期性“相同,大家都知道光棍节是哪天。此外,它的绝大多数流量来自中国境内,这也减少了一些边缘场景的种类和可能性。

正如我之前所写到的,阿里云的数据中心规模和冗余度主要集中在中国。阿里云是个身经百战的云平台,甚至比AWS还要靠得住。阿里云团队累计的实战经验也是可以应用到中国以外的地区的,但首先还是要有客户的需求和看得见摸得着的数据中心。

GCP

谷歌可以说是唯一一个拥有和需要真正全球化基础设施的公司。它的核心服务——搜索、Gmail、Youtube、地图等——和用户的确是全世界哪里都有、随时都有,当然除了中国、伊朗、朝鲜、古巴和少数阿拉伯国家以外。这些服务必须是“always on”。GCP的卖点就是把“Google 规模”的基础设施带给每家公司。至于你需不需要那就是另一回事了。

Google Services

谷歌业务给GCP带来的压力测试在两个方面与亚马逊和阿里巴巴的电商场景有实质性的不同:

  1. 谷歌的流量洪峰很难预测,不像安排好的购物节日,所以要做好随时随地可能发生的准备。这也许就是为什么Twitter大量投入与GCP的合作,而没投入其他云平台的原因。Twitter是最容易出现意外高峰的,世界大事都会在Twitter上突然爆发,影响服务的稳定。从这点看GCP是个更强大的系统。
  2. 谷歌的产品并不需要大规模处理复杂交易,如网购订单、离线送货或支付。虽然谷歌有自己购物服务和支付端口,但规模相对较小,用户相对较少。产品中复杂性与电商交易最接近的服务可能就是AdWords,一个非常成功的产品,推动Alphabet大船的摇钱树。尽管如此,AdWords作为一个拍卖广告的系统还是比较好控制的,很少经受流量洪峰的冲击。从这点看GCP对于某些类型的业务,尤其是处理模式复杂的业务,可能是个较弱的选择。

Azure

除了Azure之外,微软在企业服务软件(如Office 365)、个人应用(如Surface、XBox)和其他收入来源(如LinkedIn)都有相当规模的业务。但显而易见,这些业务都不必“always on”,和电商或搜索引擎的在线需求截然不同。半夜如果上不了Outlook,看不到公司邮件,没人会生气。但如果买不到打折的Alexa,或者开车迷路需要谷歌地图导航回家,那用户反应就不一样。

Microsoft Services

当然,从收入和市场份额的角度来看,Azure还是个表现杰出的产品。这一成功更多归功于微软多年累计的面对大型企业的销售经验,以及把现有客户的非云版产品转移到云版订阅的同样产品的积极性。微软擅长“企业级”服务,但并不一定擅长“互联网级”的需求。

这一缺陷已经Azure数据中心Region的架构上有所体现。我在上一篇文章“数据中心都在哪儿?”中详细讨论了这一点。相比竞争对手,Azure有最多的单AZ Region,从而导致了严重的停机和服务问题,这是采用多AZ设计本可以避免的问题。在AWS、GCP和大部分阿里云中,多AZ Region设计是默认的。

不管怎么说,沃尔玛已在2018年与Azure签署了长达5年的合作协议。其挑战亚马逊在美国电商霸主地位的举措,很大一部分取决于通过云端技术来提高自己的技术能力。(目前,亚马逊占美国所有电商销售额的近40%,沃尔玛仅占5%。)沃尔玛近期将推出与Amazon Prime直接对抗的产品,叫Walmart+。这个产品很有可能都搭建在Azure上。AWS和Azure这两个云系统谁能抗压,谁能保持“always on“,见分晓的时刻就快到了。

买的就是“压力”

表面上看,用云就是租公有云厂商数据中心的资源。其实你买的是公有云背后团队累计的运维经验和知识。那这些经验和知识从何而来呢?通过每时每刻服务自己公司其他业务的实战“压力“而来。

买了别人的“压力”,自己(理想情况下)就没什么压力咯。

然而,每个公有云都出过问题,不管经过了多少考验。这四大云平台都经历过多次中断。即使你在用最坚固靠谱的云平台,事情还是可能搞砸的。

回到Robinhood的宕机事件。尽管Robinhood的基础设施似乎都在AWS上,但自己的团队没有很好地构建后台系统来应对突如其来的流量洪峰 ,也没有做足够的模拟压力测试,先暴露了系统的弱点。

如果自己不做压力测试,再好的云也救不了你。

如果您喜欢所读的内容,请用email订阅加入“互联”。每周两次,新的文章将会直接送达您的邮箱。请在TwitterLinkedIn上给个follow,与我交流互动!