Ever since Anthropic released its report in February on three Chinese AI labs distilling its advanced models to train their own, I have been thinking a lot about adversarial distillation. 

I carried my thoughts to China, when I recently visited most of the leading labs there, including the three being accused. I tracked how the issue has evolved quickly into action and legislative proposals in Washington. I have monitored (the situation) of a growing narrative uttered by many that China, the entire country, distills (coded for steals) to make progress in AI, not three labs out of many from China who distills. And with Mythos and Fable coming under the crosshairs of government export control, I’m seeing many confidently predicting that delayed or limited release of American frontier models mean the end of China’s industrial-scale stealing via distillation is near.

(You can say, I have been distilling about distillation, hey yo!!)

After sitting on this issue for almost four months (I distill slow), here’s what I have learned and what I think is missing or misunderstood in the distillation debate.

 

Distillation Is Data Strategy

There are many legitimate and proper reasons to use distillation in the broader AI landscape, from pure research, to reducing the size of a large model to a smaller one to be deployable on device, to refining a general purpose large model to a more specialized vertical model.

In the specific case of adversarial distillation, however, the primary reason why certain labs do it is data. As in to quickly fix a lack of data to improve model capabilities, either generally or for specific use cases, e.g. coding, computer use, etc.

If you grasp this dimension, then why those three labs – DeepSeek, Moonshot, and MiniMax – were called out by Anthropic, but not other Chinese labs, is not so mysterious. Their relevant commonality is not that they are all from China, but that they are all independent labs struggling to obtain quality data to fuel their progress.

Out of the universe of AI labs in China, it is helpful to think of them in two categories: independent labs, like DeepSeek, Moonshot, MiniMax, Z, Stepfun, etc. and labs that live inside a giant tech company, like Qwen in Alibaba, Mimo in Xiaomi, Seed in ByteDance, Hunyuan in Tencent, etc. The independent labs resort to distillation as a shortcut, because they are more desperate for quality data. The labs that live within a big tech umbrella and have access to real world usage data from other business units inside the corporate family can (and do) leverage those data sources. This dynamic is the same as Google’s Gemini models leveraging YouTube data because they all live under the Alphabet umbrella. 

Does this mean big tech labs in China don’t practice adversarial distillation? Not necessarily, but at least the Almighty Anthropic has not caught their hands in the cookie jar, or they would have been called out publicly already. But if they do, they don't need to rely on distillation nearly as badly as the independent labs. Any measure to shut off or deter adversarial distillation won’t affect them as much.

Thus, data or lack thereof is why some Chinese labs do adversarial distillation, but certainly not all. It is not an industrial-scale, country-wide, anti-imperial socialist undertaking.

No Data Advantage in China

One myth I want to bust here and now is that China actually does not have a data advantage, when it comes to frontier AI model development. 

One thing I picked up on, which others in my group shared during our recent trip there, is that there is no full-fledged data industry in China to feed the AI labs. There is no Scale, Surge, Mercor, Handshake, or any number of data vendors, paying software engineers or management consultants to label and evaluate data, then selling that output to OpenAI or Anthropic for multiple millions of dollars.

All this flies in the face of a common view (and misconception) that China has a data advantage because it has a lot of people, they generate a lot of data, and there is no data privacy in China (another misconception), so China has a ton of data for AI. For some verticals, like manufacturing and supply chain, this is true, because there is just a lot of manufacturing and supply chain in China. But when it comes to general purpose knowledge or information that is useful to train large frontier models, much of which is derived from the Internet or structured knowledge work, China has no advantage whatsoever. (Let’s not forget, the Chinese internet also shrinks from time to time.)

I distinctly remember a researcher we met telling us how much he loves CommonCrawl, the massive open dataset crawled from the web that every lab used to start their AGI journey. His team later shared that they’ve tried using some data vendors in China, but the quality was poor and couldn’t be guaranteed, so it was a waste of time and limited resources. 

So for some labs, resorting to distillation is almost an act of desperation (and probably some laziness too), even if unseemly. Adversarial distillation is a shortcut to harvesting useful data, specifically chain of thought reasoning traces for post-training certain capabilities. Nothing more, nothing less. 

Anti-Distillation As Policy

Knowing all this, is it worth making such a big deal out of three (ok may four or five) independent labs’ adversarial distillation?

To be clear, Anthropic has every right to deny behaviors that violate its terms of service or simply does not like, much in the same way that my local restaurant can hang up a sign proclaiming its right to refuse service to anyone. As a supporter of free enterprises and free markets, I will always defend Anthropic or any private enterprise’s right to do so. 

But if your entire professional and personal identity rests on a hardcore commitment to keep China behind the US in the “race” to AGI, I don’t think anti-distillation policy is how you get there. At least, if you think what I shared in this post makes any sense, you should temper your expectations.

Lack of a robust data industry is a problem, but not a hard problem. It is not nearly as hard as building fully domestic EUV lithography machines at scale. This problem is also mostly a market dynamic one, not a technical one. If there is demand for a data industry, which is low-margin and low-barrier-to-entry, that demand will get met in China in the same way that American labs’ data hunger is fed. 

Does a regulated, delayed, or gated release of Mythos, Fable or more advanced model in the future change the dynamic? 

Not really. Because not every lab from China needs the distillation shortcut. If this shortcut is turned off for good, it will be a temporary setback for some labs who have come to rely on this tactic, then life moves on. It will hurt the independent labs the most, less so the big tech labs. Let’s be honest, the hardware usage data from Xiaomi or video data from ByteDance is at least as good as Apple’s or Google’s. 

In fact, I shared a thought with Nathan Lambert a while back, when I was still distilling on this issue, that letting some distillation happen may not be a bad idea, if the goal is to keep a competitor behind. After all, contrary to the human experience, the student model (the distilled one) can never be more capable than the teacher model. But since humans are still the ones building AI (recursive self-improvement be damned), the smart, resourceful, and hardworking students do end up more capable than the teacher, whether the teacher keeps teaching or not.