A Framework to (Dis)trust and Verify TikTok

Last week, the New York Times tech columnist, Kevin Roose, wrote a well-reasoned piece arguing that the U.S. government should not ban TikTok, but use it as an example and opportunity to build a stronger regulatory framework around tech products. (To my surprise, he also cited my proposal to open source portions of ByteDance’s internal RBAC implementations as a possibility.)

Since then, the drama around TikTok has only intensified. Whether TikTok ends up getting bought by Microsoft or outright banned, it’s fair to say that the outcome will be the result of a one-off, arbitrary decision, not a generalizable framework.

To be clear, I don’t personally care if TikTok gets banned. Sure, I’ll be a lot less entertained during my post-dinner food coma, but life will go on. However, whatever we do to TikTok, it must be based on evidence, Due Process, and in a way that can be applied to other tech products.

There is a way to establish such a framework, based on technology, not politics.

In my eyes, there are three, often-conflated but distinct, issues with TikTok that should be dealt with separately:

Sending data to China
Gathering data on users
Being a tool of foreign influence

Here’s how we can deal with them, even within the government’s current set of agencies, authorities, and capacities, if we choose to.

Data Transfer to China

The biggest national security concern with TikTok is the possibility of transferring American users’ data to China, where the Chinese government can use and abuse it. In my opinion, this is actually the easier problem to regulate given how cloud infrastructure works.

TikTok claims it stores American user data only in data centers in the United States with a backup replica in Singapore. We know it uses AWS and GCP for its cloud infrastructure. We also know that it has leased additional capacities from DLR, one of the largest third-party data center providers, in its Ashburn, Virginia location.

Every cloud data center maintains a detailed log of the traffic going in and out of its network. It’s a standard service, commonly known as Flow Logs. Tracking this information is important for internal troubleshooting, compliance, and billing customers. (As I’ve noted in my many previous posts on the cloud industry, selling network bandwidth is very profitable.)

Thus, it’s quite straightforward for the relevant government agency, likely the Department of Justice in this case, to request ByteDance, AWS, GCP, and DLR to cooperate by submitting network logs of traffic going out of all the relevant data centers to verify (not speculate) if data is being transferred to China. This can be done on a monthly, weekly, probably even a daily basis. The moment anything is transferred to China, TikTok is banned. Plain and simple. We no longer need to judge TikTok by its PR statements; we can validate its claims with technology. The same request can be made to whichever third-party data center provider ByteDance uses in Singapore and work with the Singaporean government. (Based on this interesting analysis by a French cybersecurity researcher, ByteDance appears to use AWS in Singapore as well.) Since we are dealing with American user data, the request would be reasonable and the jurisdictional nexus should be clear.

Furthermore, monitoring Flow Logs and its equivalents is a clean, clear, and credible way to protect US data sovereignty. It’s a framework that can be applied to WeChat, all the mobile games owned by Tencent (like Fortnite and League of Legends), the Russian-made FaceApp that put a scare in a lot of people last year, and any new product in the future that requires some scrutiny.

It’s generalizable and fair, not schizophrenic and arbitrary. And the information can be shared with the public to build trust and peace of mind. As an American who currently still has TikTok installed on my phone, if any data is transferred to China, I want to know!

Data Collection

It’s important to separate the data collection concern by TikTok from the national security concern of data transfer. We shouldn’t conflate the two. There is no geopolitical, US-China angle to the data collection problem if there’s no data outflow to China.

Currently, the Federal Trade Commission (FTC) is the main enforcing authority. TikTok is already on the FTC’s naughty list, having been fined $5.7 million USD in early 2019 for illegally collecting data on kids under 13, thus violating the Children's Online Privacy Protection Act (COPPA). Of course, there was the $5 billion USD fine levied on Facebook last year. Looks like the FTC is about to fine Twitter up to $250 million USD for abusing user data collected for security purposes to do ad-targeting.

There is not yet any industry consensus on whether TikTok’s data collection practice is materially worse than its American competitors or just as bad. If you read the French researcher’s analysis I cited above, it’s “just as bad”. If you read this equally fascinating reverse-engineering by a system administrator posted on Reddit, it’s demonstrably worse.

It’s worth noting that TikTok (and all of ByteDance’s consumer apps) is algorithm-driven, not social-driven. Thus, treating it as another social media product is misleading. There’s nothing “social” about using TikTok per se. You don’t have to “friend” anyone, “connect” with anyone, or even “follow” anyone if you don’t want to. The app collects data on how you watch and interact with its initial feed and adapts accordingly and algorithmically. The resulting behavior, at least for me, is: I scroll, laugh, scroll, scroll, laugh, and 30 minutes later, same thing.

Given this characteristic, from a pure product and business perspective, TikTok needs to collect as much data as possible to fuel its algorithms. The Deep Learning flavor of AI has been dominating the industry. While the various algorithms and models have been abstracted and commoditized via open source libraries like Tensorflow, PyTorch, and Keras, making those models useful requires as much data as you can get your hands on. As a product, TikTok is more akin to YouTube than Facebook. Thus, if TikTok does collect more data than Facebook, it’s not done out of malice per se; it’s part of the product.

(Aside: Eugene Wei wrote a great post analyzing how TikTok heavily uses algorithms to build its users’ “interest graph” and skipped over the “social graph”, to penetrate the American and Indian market. Worth reading.)

What’s missing in the U.S. is a nationwide legal framework that governs the intersection between data collection, privacy rights, and digital commerce. What data can an app collect? What can an app do and not do with that data? Is it ok for an app to collect my phone’s OS information (iOS or Android) and model (iPhone 8 or Nexus 6P) for security reasons? What about ad-targeting me with higher airfares because I use an iPhone, not an Android, because the app’s algorithm thinks I have more money (similar to what travel sites have done to Mac users)?

These rules must be generalizable enough to deal with all ad-driven tech products, whether it’s TikTok, Instagram, or YouTube. Putting it differently: singling out TikTok regarding its data collection (again, not data transfer) doesn’t solve the problem at its core.

With TikTok’s cultural relevance in America, we can use this opportunity to establish this long-overdue framework, perhaps drawing from both the EU’s GDPR regime and California’s own consumer privacy regulations. This will require an act of Congress. Until that happens, the FTC should continue to aggressively enforce existing laws with fines and injunctions to keep nefarious data collection practices in check.

Foreign Influence

This problem is the hardest to regulate and also most near and dear to my heart. I’ve spent a good number of years during my 20s working on the grassroots level of presidential campaigns, from organizing volunteers, to registering people to vote, to protecting those same voters from suppression and intimidation.

One of the biggest fears about TikTok is that it can be used by the Chinese government as a platform to influence Americans, especially during an election. In my opinion, this fear is valid but minor. If you’ve spent any time on the grassroots level of any election in the U.S., you would know that the election infrastructure is brittle at best. I can easily think of a handful of other much more pressing issues than TikTok that must be addressed to preserve the integrity of the American electoral system:

Insufficient number of voting machines; mail-in ballot irregularities; many voting machines still running Windows 2000; voter suppression practices in minority communities; general human incompetence at polling sites; Facebook and Twitter; the Russian government; intimidation by White gun owners in predominantly Black neighborhoods on Election Day (yes this happened). The list goes on...

Another reason why TikTok is not a pressing issue is also because the Chinese government, so far, has shown little sophistication in driving cultural wedges among Americans in the way the Russians have done in 2016. But they are trying and that deserves attention. Back in February, when the pandemic had gone global and China was still in a nationwide lockdown, my TikTok “For You” feed had a few out-of-place tourism videos of both Wuhan and Guangzhou (one of the hardest hit cities in China other than Wuhan) show up. I didn’t “like” them and similar videos never showed up again. I guess the algorithm worked the way it’s supposed to. But I don’t know why those videos showed up in the first place. Nobody knows. The algorithms are opaque.

The heart of TikTok’s foreign influence problem is the algorithm. Like I’ve said in a previous post, there is “no Due Process in an algorithmic world”. Due Process requires that decisions be made publicly and not arbitrarily. An algorithm is an automated decision making process. If an algorithm has no Due Process, its decisions have no legitimacy. The millions of decisions that the TikTok algorithm makes every minute and second to determine what we see next is anything but public.

How do we make these algorithms more transparent? The right way is to open source them. It’s not a big departure from where the industry is -- most AI frameworks that make up the building blocks of an AI-based application are already open sourced. To achieve meaningful transparency, what needs to be open sourced is:

The number of parameters applied in the algorithm and how these features are engineered and used on the user data collected.

The importance of the information can be easily illustrated with GPT-3, the OpenAI’s new AI model. GPT-3 has 175 billion parameters. Its predecessor, GPT-2, has 1.5 billion parameters. That’s an increase of more than 100x. Combined with a larger dataset to train on, no wonder GPT-3 can do all kinds of “magical” things. And all this information is public and open sourced. The same transparency can and should be applied to TikTok and its competitors, so regulators can continuously verify whether and how their products can be gamed by sources of foreign influence.

This approach may seem overly idealistic, but many US agencies already have some open source muscles. I noted in “COVID, Open Source, Industrial Policy” that in 2016, the Obama administration released the Federal Source Code policy, which requires all Federal agencies to open source 20% of their custom-built software. Since then many agencies have done exactly that, including some key participants of CFIUS, like the Treasury, Justice, and Homeland Security. (CFIUS is of course the central regulatory body in this TikTok saga.) Today, anyone can find and use the code open sourced from these departments on code.gov.

There’s no doubt that compelling the likes of Facebook, YouTube, Twitter, and TikTok to open source the inner workings of their algorithms will be difficult. But it’s exactly that: difficult. Not impossible. Given the shakiness of the entire American institution, there is no better time for regulators to do something big and difficult that gets at the core of the problem.

Distrust is Easy, Verify is Real Work

A recent speech by Secretary of State Mike Pompeo made waves in his attitude shift from “trust and verify” to “distrust and verify” towards China. Yesterday, he followed up with an expansion of the so-called “Clean Network” targeting business practices related to Chinese tech firms, from Huawei, to Alibaba, Tencent and Baidu’s cloud platform, to even undersea cables. While the announcement was full of verbiages of “distrust,” it contained no information on “verify”.

Distrust is easy, “verify” is where the real work lies. If we don’t do the real work to understand the technology, product, algorithms, parameters, and the technical and regulatory tools we do have at our disposal, it doesn’t matter if we trust or distrust.

Like I mentioned in the beginning, whatever we do to TikTok, we must do so with evidence and Due Process. Otherwise, we aren’t doing the real work to verify. Otherwise, America is no longer American.

America is a nation of laws. Laws are only meaningful and credible, if there’s procedural justice to the outcomes when the laws are applied to reality. As I’ve hopefully laid out, there are technologies and tools at American legislators and regulators’ disposal to do the real work to verify and protect the American people.

The question is: given how attractive China is as a political pinata, are they even interested?

If you like what you've read, please SUBSCRIBE to the Interconnected email list. New posts will be delivered to your inbox (twice per week). Follow and interact with me on: Twitter, LinkedIn.

一个（不）信任加验证TikTok的监管框架

上周，《纽约时报》科技专栏作家 Kevin Roose 写了一篇文章，论述美国政府不应封杀TikTok，而应以此做个榜样和机会，来加强有关所有科技产品的监管框架。（小惊喜，他还引用了我的提议，即开源字节跳动内部落实 RBAC 的代码部分。）

文章出版后，围绕TikTok在美国的未来的演变愈演愈烈。不管TikTok最终是被微软收购，还是被彻底封杀，可以说都是个一次性的、武断的结果，而不是一个可概括管制其他产品的框架。

说实话，我个人并不在意TikTok是否被禁止。虽然我今后在饭气攻心的时候会少一个能让我消磨时间的东西，但该怎么活还怎么活。然而，无论最终我们怎么对待TikTok，它都必须以证据和“正当程序”（Due Process）为基础，并且这个过程要也能被用来管制其他科技产品。

这个目标不是不可能的。有方法可以建立这样一个基于技术而不是政治的监管框架。

在我看来，TikTok有以下三个不同的问题。舆论经常把它们混在一起，但应该分别处理：

向中国发送数据
用户数据收集
成为外界势力的影响工具

即使是在美国政府现有的机构和能力范围内，我们也是能公正地处理TikTok的种种问题的。

向中国传输数据

TikTok最大的国家安全担忧是它有可能将美国用户的数据传到中国，中国政府可以滥用这些数据做坏事。在我看来，根据到云计算基础设施的特性，这个问题实际相对容易监管。

TikTok声称它只在位于美国的数据中心存储美国用户数据，在新加坡有一个备份副本。我们知道它使用AWS和GCP作为云基础设施。我们还知道，它已经从DLR公司租赁了额外容量，位于弗吉尼亚州。DLR是全世界最大的第三方数据中心提供商之一。

每个云数据中心都有详细的网络日志，记录网络进出的流量和信息。这是一种标准服务，通常称为“流日志”（Flow Logs）。记录日志信息对于内部故障排除、有关法规遵从，以及给客户发帐单都很重要。（正如我之前关于云计算行业写的许多文章中所指出的，销售网络带宽容量是个利润很高的生意。）

因此，相关政府机构（应该是司法部）可以很正当的要求字节跳动、AWS、GCP和DLR一起合作，提交所有相关数据中心的流量网络日志，以验证（而不是猜测）是否有数据传输到中国。可以每月、每周、甚至每天检查一次。一旦查到有信息被转移到中国，TikTok就可以立即被禁。简单明了。我们不再需要通过TikTok的公关声明来评判它，而可以用技术来验证它。同时也可以向字节跳动在新加坡使用的数据中心提出同样的要求，并与新加坡政府合作。（根据这位法国网络安全研究员的分析，字节跳动似乎在新加坡也用AWS。）因为处理的是美国用户数据，所以这种监管要求也是合理的，法律管辖关系应该也是明确的。

此外，监控流量日志（或类似的云服务）是保护美国数据主权的一种及干净清晰又可信可靠的方式。这个监管框架可以用在微信，腾讯旗下的所有的游戏（比如Fortnite和League of Legends），去年吓到很多人的俄罗斯公司做的FaceApp，以及未来会出现的其他需要监管的科技产品。

这个框架有可概括性而且公平，还不是随意和武断的。这些信息也可以与公众分享，建立信任，让大家心里更踏实。作为一个美国公民，我的手机上目前还有TikTok，如果我的数据被传输到中国，我需要知道！

用户数据收集

把TikTok的用户数据收集问题与数据传输的国家安全问题分开是很重要的。我们不应该把两者混为一谈。如果没有数据外流到中国，数据收集问题也就不关地缘政治、中美关系什么事。

目前，联邦贸易委员会（Federal Trade Commission，FTC）是主要的监管机构。TikTok已经在FTC的“坏孩子”名单上了，因为非法收集13岁以下儿童的数据，从而违反了《儿童在线隐私保护法》（COPPA），在2019年初被罚款570万美元。去年Facebook也被罚了50亿美元。看似FTC也将会罚Twitter 2.5亿美元，因为Twitter滥用为安全目的收集的用户数据用来打广告。

对于TikTok这个产品本身的数据收集程度是比其美国竞品更糟糕，还是同样糟糕，业界尚未达成共识。如果您读了我上面链接的那位法国研究员的分析，结论是“同样糟糕”。如果您读了另一位系统管理员在Reddit上发的一篇用逆向工程分析TikTok的分析，结论就是“更糟糕”。

值得注意的是，TikTok（以及所有字节跳动的产品）都是算法驱动的，而不是社交驱动的。因此，将其视为是一种社交媒体产品是有误导性的。想用TikTok本身没有什么“社交”需要做。你不需要“加”任何人，“跟”任何人。产品会收集关于你如何观看视频的所有信息，然后用算法自动进行相应的调整。最有产生的用户行为，起码以我的个人经验来说，就是：刷，笑，再刷，再刷，笑，30分钟后，还在刷。

考虑到这一特性，从纯产品和业务的角度来看，TikTok是需要收集尽可能多的数据来支持它的算法。深度学习版的人工智能已经在主导整个行业。虽然各种算法和模型已经通过像Tensorflow、PyTorch和Keras等开源项目被抽象出来了。但要想使这些模型有用，就需要尽可能给它们灌输大量的数据。作为一个产品，TikTok更像YouTube而不是Facebook。因此，如果TikTok的确比Facebook收集的用户数据更多，这也并不是出于恶意，而是产品的一部分。

（旁白：Eugene Wei写了一篇很棒的文章，分析了TikTok如何用算法驱动来构建用户的“兴趣图”，而跳过了“社交图”，从而渗透了美国和印度市场。值得一读。）

美国缺少一个全国性的法律框架来管理数据收集、隐私权和数字化商业之间的交叉点。一款app到底可以收集哪些数据？App对这些数据能做什么，不能做什么？如果出于安全考虑，app可以收集我手机的操作系统信息（如iOS或Android）和型号（如iPhone8或Nexus6P）吗？如果因为我用的是iPhone，而不是Android，所以app算法认为我有钱，从而向我推更贵的机票（类似旅游网站对Mac用户的那样），这又可以吗？

无论最后的规则是什么，都必须有足够的概括性，来管制所有由卖广告驱动的社交或媒体产品，无论是TikTok、Instagram还是YouTube。换句话说：单个挑出TikTok的数据收集行为（请记住，不是数据传输）做文章，并不解决核心问题。

鉴于TikTok在美国主流文化的影响力，我们可以好好利用这个机会来建立这个早该有的框架，或许可以借鉴欧盟的GDPR制度和加州自己搞的消费者隐私条例。应该只有国会才有权利做这件事情。在发生之前，FTC应该继续积极执行现有的法律，通过罚款和禁令来遏制任何产品的不正当的数据收集行为。

外界势力影响

这个问题最难监管，也是我个人最在乎的一个问题。在我20多岁的时候，花了很多年在总统大选团队的草根阶层做工作，从组织志愿者到登记选民投票，再到保护这些选民不受压制和恐吓。

美国对TikTok最大的担忧之一就是，它可能被中国政府用来影响美国人民的意愿和想法，尤其在选举期间。在我看来，这种担忧并不是没有道理，但也不是个棘手的问题。如果您在美国任何一次选举的草根层面做过点事情，哪怕是志愿工作，就会知道整个选举基础设施是极为脆弱的。我一拍脑袋就能想出一大堆比TikTok更棘手的问题，必须尽快得到解决，来维护和保护美国选举制度的正常运行：

投票机数量不足；邮寄选票不规范；许多投票机仍在运行Windows 2000；对少数族裔居住区域的选民压制；投票站工作人员的总体无能；Facebook和Twitter；俄罗斯政府；大选当天，白人持着枪去黑人居民为主的社区恐吓他们（没开玩笑，是真故事）。这个列表可以很长。。。

另外一个TikTok不是个棘手问题的原因，也是因为到目前为止，中国政府在利用各种宣传手法在美国人民之间挑拨离间的做法并没有那么成熟和熟练，还达不到俄罗斯政府在2016年大选期间所做的那样。但有往这个方向努力，所以值得关注。早在今年2月份，当疫情已经蔓延到全球，而中国仍处于全国封锁的状态时，我TikTok上的“For You” feed莫名其妙的出现了一些宣传武汉和广州的旅游视频。我没给这些视频“点赞”，以后就再也没有出现。看来算法还是在正常运作。但做为用户，我无法了解这些视频为什么一开始会出现。也没有人能了解。因为这些算法是不透明的。

TikTok的外来势力影响问题的核心是算法。就像我以前一篇文章中所说的，“在一个算法驱动的世界里没有正当程序”。在一个正当程序（Due Process）里，每个决策的过程是公开而不武断的。算法既是一个自动的做决策过程。如果一个算法没有正当程序，它产出的决策就没有合理性。TikTok的算法每分每秒都要做出数百万个大大小小的决策来决定用户接下来会看到什么，但这些决策都是不公开的。

那我们如何让这些算法更加透明呢？正确的方法是开源。其实这与业界的目前情况并没有太大差异。做一款人工智能驱动的app的绝大板块都已经开源了。想要达到有意义的透明度，需要开源的部份是：

算法中用的参数数量，以及这些参数是如何设计和如果使用到用户数据上的。

为什么这些信息重要，其实可以很容易地用GPT-3，OpenAI的新AI模型，来说明。GPT-3有1750亿个参数。它的前身GPT-2有15亿个参数。这是100多倍的增长，再加上更大的训练数据集，难怪GPT-3能做各种“神奇”的事情。所有这些信息都是公开的和开源的。同样的透明制度可以用在TikTok及所有其他竞品身上。这样的监管制度可以不断验证这些产品是否以及如何会受到外界势力的影响。

这种想法听起来也许过于理想化，但其实许多美国联邦政府机构已经具备了一些开源的实力。正像我在 “COVID，开源，工业政策” 中指出的，2016年，奥巴马政府发布了一款“联邦源代码政策”，要求所有联邦机构将其定内制软件的20%代码开源。从那时起，许多机构都做到了这一点，其中包括美国外国投资委员会（CFIUS）的几个关键成员，如财政部、司法部和国土安全部。（CFIUS当然也是围绕TikTok所有的新闻中的最重要的监管机构。）今天，任何人都可以在code.gov 的网站上看到和使用这些代码。

毫无疑问，迫使Facebook、YouTube、Twitter和TikTok这些产品开源它们的算法会是件困难的事情。但也仅此而已，就是困难。而不是不可能。考虑到整个美国体制目前的不稳定，监管人员应该利用这个时刻去做一些大事，困难的事，从而管制整个问题的核心所在。

不信任很容易，验证才是真正该做的工作

美国国务卿Mike Pompeo最近在一次演讲中表达了政府对中国的态度从“信任加验证”转变为“不信任加验证”。昨天，他宣布扩展了所谓的“清网”，目标是限制与中国科技公司相关的商业行为，从华为到阿里，腾讯和百度提供的云计算服务，甚至还包括海底电缆。虽然公告中充满了”不信任”，但没有任何关于怎样“验证”的信息。

想“不信任”很容易，“验证”才是真正该做的工作所在。如果我们不真正了解技术、产品、算法、参数以及政府可以使用的各种科技和监管工具，那么无论是“信任”还是“不信任”都不会真正解决问题。

就像我在文章开头提到的，无论最终怎么对待TikTok，都必须有证据和正当程序。否则，就没有做到该做的验证工作。否则，美国就不再是美国了。

美国是一个法治国家。只有有流程公正的法律结果，才是有意义的，才是可以信任的。希望本文清楚的解析了，美国立法者和监管者可以使用的一些现有技术和工具来做真正需要的验证工作，从而来保护美国人民。

问题是：反华是个这么诱人的政治王牌，他们有兴趣做需要做的事情么？

如果您喜欢所读的内容，请用email订阅加入“互联”。每周两次，新的文章将会直接送达您的邮箱。请在Twitter、LinkedIn上给个follow，与我交流互动！

A Framework to (Dis)trust and Verify TikTok

Kevin Xu

Kevin Xu

Data Transfer to China

Data Collection

Foreign Influence

Distrust is Easy, Verify is Real Work

一个（不）信任加验证TikTok的监管框架

向中国传输数据

用户数据收集

外界势力影响

不信任很容易，验证才是真正该做的工作

The TikTok Template

RESTRICT First, Ban Later

Are Americans Worried About Chinese Apps?

August 9, 2020: Election Interference, Clean Network, TikTok and London, Zoom, Shandong, Huawei Chips

August 2, 2020: Antitrust in China, Open Source 5G, Electric Vehicles IPO, ByteDance, Huawei

Data Transfer to China

Data Collection

Foreign Influence

Distrust is Easy, Verify is Real Work

一个（不）信任加验证TikTok的监管框架

向中国传输数据

用户数据收集

外界势力影响

不信任很容易，验证才是真正该做的工作

Subscribe to Interconnected