I used to work at an open source distributed database startup called PingCAP. As the exec driving the company’s global market expansion effort, I talked to a lot of cloud architects and distributed systems engineers (as potential customers) about how a distributed database can be used for “disaster recovery”. It is a term used often by technical folks whose day (and night) job it is to keep things up and running at all times, no matter what happens.

When I first heard the term “disaster recovery” in the cloud context, I thought it was a bit dramatic. What sort of “disaster” are we talking about here that needs military-level operational practices, multiple backup plans, and sophisticated technologies to “recover” from? The answers I often get are along the lines of: “You never know. It could be anything! An earthquake. A typhoon. A pack of rats chewing up cables.”

As I’ve spent more time in the cloud and software infrastructure industry, I’ve grown to appreciate the reality that shit happens! Disasters are unavoidable! They may not happen all the time, but when they do, if you don’t have the right technology, architecture, or operation to recover, these disasters will ruin you.

A week ago, such a disaster happened in South Korea, ruined a CEO, triggered an investigation from the country’s president, and illuminated the awkward dilemma of data residency for a small country.

Kakao’s Disaster

The disaster I’m referring to is the fire that broke out in a data center near Seoul that lasted more than 10 hours, causing outages across all services provided by Kakao, the Korean tech giant. Kakao’s suite of apps, which covers messaging, ride hailing, map, gaming, webtoon, payment, banking, is used by over 90% of South Korea’s Internet users and the hundreds of thousands of the Korean diaspora living around the world. It’s a legit SuperApp. If you travel to South Korea, as I did earlier this year, the first thing you do is install Kakao Talk and Kakao Map, so you can message people and find your way around.

Since the fire, the co-CEO of Kakao responsible for its data operation resigned. South Korea’s president, Yoon Suk-yeol, is launching an investigation, forming a new "Digital Crisis Management Headquarters", and voicing alarms about Kakao’s monopolistic power. There are even conspiracy theories about the fire being an act of sabotage, not an accident.

The intrigue of monopoly, investigation, sabotage, and more will keep this firestorm burning (pun intended) for weeks to come. But they all steer away from the root problem that this fire illuminates – is a robust data center disaster recovery practical in a small country?

Because disasters are unavoidable, industry best practice for implementing “disaster recovery” is not to avoid disasters, but to mitigate damages by what is commonly known as “geo-replication.” In layman's terms, “geo-replication” for a tech company means making copies of apps, services, and users’ data and putting copies in different data centers that are geographically far away as backups. If a disaster hits one data center, you turn on one of your backups to keep your apps and services running (albeit with slower performance due to farther geographical distance), while you recover from the disaster, e.g. putting out a fire.

At first glance, it does not look like Kakao properly put in place a geo-replication scheme that could support disaster recovery. SK Holdings C&C operates the data center that caught on fire, and it appears to be the only data center it operates in South Korea! If that is indeed the case, then there is no other data center run by SK for Kakao to geo-replicate to. If so, that would definitely explain the long outage time, because it is unthinkable for a data center fire to last more than 10 hours, yet no backup data center was turned on to keep Kakao’s apps functioning, while the fire is being put out.

Data center locations can often be treated as trade secrets, so perhaps SK does have another data center that we don't know about. Even so, given South Korea’s size as a country, would it be far enough to sufficiently support disaster recovery regardless?

How Far is “Far Enough”?

There are no hard and fast rules to how far is “far enough” for geo-replication. But we do have some references from large countries with advanced data center layout to draw from.

In the US, it is well-known that the densest concentration of data centers is in Northern Virginia. Why? Because the location is close to both dense cities of east coast Internet users and Washington DC, where data center operators (AWS, GCP, Equinix, Digital Realty, etc.) can leverage existing, high-quality digital infrastructure already built for the US government. A common backup location for the Northern Virginia cluster is Council Bluffs, Iowa, a small town in the midwestern state that has become a hot destination for building data centers. (For more on why Iowa is attractive for data centers, see this deep dive from The Atlantic a few years ago.)

How far is Northern Virginia from Council Bluffs? About 1,100 miles (~1770 kilometers)

Let’s look at another reference point, in China. It is also well-known that Beijing has a cluster of data centers, for more or less the same reason as Northern Virginia – high concentration of Internet users, access to high-quality infrastructure that serves the government, etc. A common backup location is Xi’an – home to the Terracotta warriors, birthplace of dumplings (disputable), and data centers that store backup copies if a disaster hits Beijing.

How far is Beijing from Xi’an? About 1,100 kilometers (~700 miles)

Whether the rule of thumb is 1,100 miles or 1,100 kilometers, that is at least twice as far as the entire north-south “height” of South Korea. The distance from Seoul to Jeju Island, South Korea’s southernmost territory (and where Kakao’s corporate HQ happens to be), is only about 450 kilometers.

To provide good user experiences, proximity matters. To provide good disaster recovery, distance matters.

It makes 100% sense for SK to build a data center in the suburbs of Seoul. It makes 100% sense for Kakao and other Korean tech companies to use this data center to serve its millions of users, mostly concentrated in Seoul. However, to make disaster recovery work, it makes little sense for the backup data center to be located inside South Korea given its size.

And that brings us to the awkward dilemma of data residency for a small country.

Data Residency for Small Countries

Laws and policies around data residency have been popping up in many countries in the last few years. Since the promulgation of the EU’s GDPR and a growing consensus among national governments that its citizens' data is worth at least something, countries like China, India, Brazil, Nigeria, and many others, have all started legislating their own flavor of GDPR to exert control over their people’s data. As the most “wired” country in the world, South Korea has been developing its own “flavor” as well!

Although the justification for data residency can be different – some highlight national security concerns, some note personal privacy issues, some are straight up rent-seeking from rich tech companies – the way to comply is pretty much the same: store Country X citizens’ data in data centers located in Country X and nowhere else.

As our discussion of the “Kakao fire” has hopefully illustrated, the “nowhere else” part can be tricky, if there is physically not enough land to implement disaster recovery with far enough distance for geo-replication within your own borders.

The tradeoff between data residency and disaster recovery presents an awkward dilemma for small but technologically-advanced countries.

This is also where data locality and foreign relations with neighboring countries interconnect in interesting ways. After all, many EU countries are just as small and technologically-advanced as South Korea. GDPR has managed to function (so far) in large part because two-decades worth of trust has been seeded among EU countries, so that when GDPR was rolled out in 2012, most members feel comfortable being treated as a single data collective without fear of being taken hostage. Scandinavian countries that use the Stockholm data center are not too concerned about the geo-replication in Frankfurt or Paris.

The same level of trust, unfortunately, cannot be said among East Asian countries. For a data center in Seoul to have a backup located 1,000 miles (or 1,000 kilometers) away, it will have to be somewhere in China, Japan, Russia or the middle of the ocean.

Plenty of grievances, both historical and current, still exist among all these countries, while relationships and alliances are shifting constantly. Of course, I’m not saying that if the backup data center is located in Osaka or Shanghai, Japan or China will take Kakao's user data hostage to harm South Korea. But it is a risk that has to be negotiated and minimized, not assumed away.

When the “Kakao fire” investigation wraps up, the public conclusion will unlikely be: “disasters like this fire is unavoidable in data centers, South Korea is too small for proper geo-replication, so we need to strengthen relationships with our neighbors to architect robust disaster recovery while enforcing data residency.”

But it might as well be.

Kakao,火灾和数据驻留的困境

(本篇中文版文章是读者 Ben Yu 做的编译,我做了一些修改后发表。非常感谢Ben的贡献!)

我曾在一家名为 PingCAP 的开源分布式数据库创业公司工作,作为负责推动公司全球市场扩张的高管。当时,我和许多云架构师和分布式系统工程师(他们都是潜在客户)讨论分布式数据库如何做好 “灾难恢复”。灾难恢复是一个行业术语,对于公司里的运维人员来说会比较熟悉,因为他们的日常工作就是确保一切系统都在正常运作,避免“灾难”。

当我第一次听到这个术语时,坦白说我的第一反应是:这个词有点太夸张了,到底有什么样的“灾难”,才会需要用到军事级别的操作实践、多重备份计划和复杂技术来“恢复”?我得到的回答基本是:“什么都有可能:地震、台风,甚至是老鼠啃电线。”

随着我在云计算和基础软件设施行业里累计多年的经验,才逐渐意识到这种灾难并不遥远!它们可能不会一直发生,但是当它们发生时,如果没有正确的技术、架构或操作来恢复,那将会是毁灭性打击。

一周前,就有一场这样的灾难发生在韩国,导致一个 CEO 引咎辞职,还引发了该国总统的调查。这起事件也揭示了对于一个小国家来说数据驻留的尴尬困境。

Kakao 的灾难

我说的灾难是是指一个首尔附近的数据中心发生的火灾,这场火灾持续了 10 多个小时,导致韩国科技巨头 Kakao 提供的所有服务中断。Kakao 的 App 包括即时通讯、网约车、地图、游戏、webtoon、支付和银行业务,超过 90% 的韩国互联网用户和生活在世界各地的数十万韩国侨民都在使用这些 App。我今年去韩国的时候,做的第一件事就是安装 Kakao Talk 和 Kakao Map,这样就可以和其他人发信息,以及导航。

火灾发生后,负责 Kakao 数据业务的联席首席执行官辞职。韩国总统尹锡悦正在展开调查,组建一个新的“数字危机管理总部”,并对 Kakao 的行业垄断发出警告。甚至有阴谋论认为这场火灾是蓄意破坏,而不是意外。

垄断、调查、破坏以及更多的阴谋让这场灾难不会那么快结束。但是所有讨论都偏离了这场大火的本质问题 —— 一套靠谱的数据中心灾难恢复措施在一个小国家里能有效执行吗?

由于灾难本身是不可避免的,实施 “灾难恢复” 的行业最佳实践不是去避免灾难,而是通过“跨地域复制”(geo-replication)来减轻损失。简单来说,对于一家科技公司而言,“跨地域复制”意味着复制应用程序、服务和用户数据,并将复制版放在距离地理位置较远的另一个数据中心作为备份。如果灾难袭击了一个数据中心,就转为使用另一个备份版来保持应用和服务运行(尽管由于地理距离较远,性能会变慢),直到灾难被解决,比如把火灾扑灭。

乍看之下,Kakao 似乎并没有适当地设置一个地理复制方案来支持灾难恢复。SK C & C 运营的数据中心起火,它似乎是唯一一个在韩国境内运作的数据中心。如果事实如此,那就没有其他的数据中心可供 Kakao 进行“跨地域复制”。这个猜测可以解释为什么停机时间这么长,因为数据中心的火灾不太可能持续超过 10 个小时,然而在大火被扑灭后,没有开启备份数据中心来维持 Kakao 软件的正常运行。

数据中心的位置经常被视为商业机密,所以 SK 集团可能确实有另一个我们不知道的数据中心。即便如此,考虑到韩国的地域规模,它是否足以为灾后恢复提供足够的支持?

备份的数据中心到底需要距离多远?

足够远到底是多远,其实并没有一个标准。但是可以借鉴一些具有先进数据中心布局的大国的经验。

在美国,众所周知,数据中心最密集的地方在北弗吉尼亚州。为什么?因为这个地方靠近东海岸网络用户密集的城市和华盛顿特区,那里的数据中心运营商(AWS、 GCP、 Equinix、 Digital Realty 等)可以利用现有的、为美国政府建造的高质量数字基础设施。北弗吉尼亚州集群的一个常见备份位置是爱荷华州的康瑟尔布拉夫斯,这个位于美国中西部的小镇已经成为建设数据中心的热门地点。(如果想了解更多关于爱荷华州对数据中心的吸引力,可以阅读几年前《大西洋月刊》的这篇文章

北弗吉尼亚离康瑟尔布拉夫斯有多远? 大约 1770 公里。

对于中国来说,不难想象北京拥有众多的数据中心,其原因与北弗吉尼亚大致相同——高度集中的互联网用户,为政府服务的高质量基础设施等等。一个常见的备份地点是西安。如果北京的数据中心发生灾难时,西安的备份数据就会用上。

北京离西安有多远?大约 1100 公里。

1100 公里已经是韩国整个南北跨度的两倍。从首尔到韩国最南端的领土济州岛(恰好是 Kakao 的公司总部所在地)只有大约 450 公里,这就产生了一个不可解决的天然问题。

如果想提供好的用户体验,(近)距离很重要。如果想提供好的“灾难恢复”构架,(远)距离很重要。

对 SK 集团来说,在首尔郊区建立一个数据中心是非常合理的。对于 Kakao 和其他韩国科技公司来说,使用这个数据中心为数百万用户服务也是合理的,这些用户大多集中在首尔。然而,考虑到备份数据中心的规模和距离需要,要使灾难恢复工作正常进行,将备份数据中心设在韩国境内则没有多大意义,无法降低风险。

这个问题暴露了一个对所有小国家来说对数据驻留需求的尴尬境地。

小国家数据驻留的尴尬

在过去的几年中,围绕数据驻留的法律和政策在许多国家如雨后春笋般涌现。自从欧盟的 GDPR 公布以来,各国政府越来越一致地认为其公民数据至少有一定价值,像中国、印度、巴西、尼日利亚和许多其他国家都开始立法实施自己的 GDPR,以便对其公民数据施加控制。作为世界上互联网发达程度最高的国家之一,韩国也一直在发展自己国家的数据法律

虽然数据驻留的理由各色各样 —— 有些强调国家安全问题,有些注意到个人隐私问题,有些则是向赚得盆满钵满的科技公司寻租 —— 但遵守的方式几乎是一样的:将 X 国公民的数据存储在位于 X 国境内的数据中心,而不能是其他地方。

正如我们在关于 Kakao 火灾讨论中说到的,如果实际上没有足够的土地来实施灾难恢复,无法在自己国家边界内进行跨地域复制,那整件事就变得很棘手。

数据驻留和灾后恢复之间的权衡使技术先进的小国陷入尴尬的两难境地。

这也使数据本地化和与邻国的外交关系开始有潜移默化的互联。毕竟,许多欧盟国家和韩国一样小,科技发展程度也类似先进。GDPR 至今之所以能够发挥作用,很大程度上是因为欧盟国家之间已经建立了 20 年的信任,所以当 GDPR 在 2012 年推出时,大多数成员国感到自在地被当作一个单一的数据集合来对待,而不用担心被劫持为“人质”。使用斯德哥尔摩数据中心的斯堪的纳维亚并不太担心法兰克福或巴黎的跨地域复制会被当成“外交筹码”。

然而东亚国家之间没有达到同样的信任水平。如果首尔的一个数据中心要在 1000 公里之外建立备份,那么它的地理位置基本只能在中国、日本、俄罗斯或大洋中央的某个地方。

在外交关系和联盟不断变化的同时,所有这些国家之间仍然存在着大量的不满,无论是历史遗留的,还是现代问题造成的。当然,我并不是说,如果备份数据中心设在大阪或上海,日本或中国会拿 Kakao 的用户数据作为人质来伤害韩国。但这种风险必须协商并最小化,而不是想当然地认为没有风险。

当 Kakao 火灾调查结束时,公众不太可能得出这样的结论:“像这样的灾难在数据中心是不可避免的,韩国太小了,无法进行适当的跨地域复制,所以我们需要加强与邻国的关系,以建设强大的灾后恢复,同时执行数据驻留。”

但是,谁又说得准呢?