3x3 AI Video Matchup: US vs China

The Paris Olympics may be over, but who says heated competition in bizarrely mutated sporting events only seen at the Olympics fueled by irrationally intense nationalism should stop?  

As a lifelong basketball fan, one of the more bizarrely mutated Olympic events is 3x3 basketball. As a full-time observer and occasional commentator on the US-China AI competition, I do find the increasing fervor of techno-nationalism is inching ever closer to the realm of irrationality. (And I don’t absolve myself of some responsibility here, having written quite a bit about AI in this geopolitical lens.)

In the spirit of friendly, fun “competition”, let’s take the silly format of 3x3 basketball and match up three text-to-video AI generators from the US and China, respectively, and see which one is “better”. This matchup is only made possible because, in recent days, several Chinese companies have released new text-to-video offerings that are easily accessible. Unlike their LLM chatbot cousins, anyone outside of China or aren’t familiar with China can play with them without a Chinese cell phone number or ID to register an account. Just need an email to get a free account and stand in line for those GPUs to do some inferencing!

For this matchup, I chose three leading AI video generation startups for Team USA: Runway, Luma, Pika, and three rough equivalents for Team China: Vidu, Zhipu, Kling. (The much hyped Sora is still not publicly accessible.) I gave each of them the same, culturally-neutral prompt: “two sports cars racing each other in the snow.” And I only used the free offering from each provider, putting myself in the shoes of a user from anywhere in the world, kicking the tires on a new AI product. 

Here are the videos: 

Team USA

Runway:

0:00
/0:04

Luma:

0:00
/0:05

Pika:

0:00
/0:03

Team China

Vidu:

0:00
/0:03

Zhipu:

0:00
/0:06

Kling:

0:00
/0:05

Quick Evaluation

Personally, I find Runway’s output to be the highest fidelity and matches closest to my “fast and furious” taste. But both Kling and Vidu generated solid videos with an intense, car racing feel as well. Weirdly, only Pika seemed to grasp the number “two” accurately in the prompt; all the other models gave me an extra third car at some point in the video. (Thank you AI?) 

Zhipu’s output was of the worst quality, but in all fairness, it was likely also the smallest model (at 2 billion parameters), and I ran it using HuggingFace Space’s limited resources. It is the only open source “player”, while the other five “players” all had a more polished, productized interface to boot. 

Luma’s video was decent but took by far the longest to generate, on the order of hours, because its GPU queue was long. All the other generators, even Zhipu on HuggingFace, took only a few minutes. On the ever-so-important “time to value” dimension, Luma was the least attractive.

So who won this great technological rivalry of a 3x3 matchup? I would give Team USA a slight edge in victory, though a couple of missed free throws (or crappier training data) and a favorable bounce here and there (or better inference optimization) could have easily sent this game to overtime.

Therein lies a larger takeaway. Unlike LLM chatbots, which are culturally, linguistically, and contextually dependent, genAI videos could potentially be a more competitive realm globally with a more level playing field.

Bullish on GenAI Video

Regular readers of Interconnected know that I have been consistently bearish on the global prospect of LLM chatbots from Chinese tech companies for some time. Benchmark aside, releasing a chatbot that is marginally better in the Chinese language than the latest ChatGPT or Claude, while being equal or worse in other respects just won’t make an attractive product globally, when the Chinese language is not the lingua franca in anything, and likely won't be for some time. Combining that with a stifling domestic environment for any “soft tech” that could influence public discourse and a corpus of Internet data that is shrinking and getting harder to parse, Chinese LLM chatbot makers have been dealt a tough hand and have their work cut out for them. 

But genAI video could be a different story.

On the data front, Team China may have a slight advantage in this case. It is well-known at this point (I hope) that Chinese tech platforms boast the most robust video-first products, from livestream e-commerce to short-video social networks. If you are bullish on Google’s AI prospect because of its rich library of video data from YouTube, then the same theory of the case applies to Tencent, Alibaba, ByteDance, and Kuaishou, each of which has their own rich library of video data that is ever-growing.

On the compute side, due to export control, Team USA still holds a decisive advantage, and there is no easy way for China to overcome this challenge with homegrown hardware. Even if you assume the best case scenario out of Huawei’s latest Ascend 910C chips, rumored to be competitive with Nvidia H100 chips, producing those chips at scale with high yields is the tougher hurdle. Until that hurdle is cleared, Chinese AI companies’ appetite for GPUs will go unsatiated. 

On the engineering front, although Team USA holds the lead in attracting and attaining the elite AI R&D talent, the type of distributed systems or infrastructure engineering capabilities are more evenly distributed. Based on my assessment, large systems engineering talent in China is globally competitive, since they have accumulated real world experience of operating massive Internet companies for more than a decade. These non-R&D talent could give an edge to Team China, when it comes to optimizing the deployment and inference workload of genAI video models via infrastructure software to reduce cost, while their GPU-rich American counterparts may have less motivation to do so.

On the safety side, there is no distinguishable difference between how a US vendor would treat the usual panoply of AI-generated video concerns – deepfakes, violence, pornography, misinformation – any differently from a Chinese vendor. Whether you are Runway or Vidu, all these concerns have to be safeguarded against with equal vigor in order to stay competitive in the marketplace.

Thus, the playing field for genAI video is relatively fair, equal, and void of cultural dependencies. Assuming a customer is a benign user of genAI video technologies (e.g. making videos for prompting a business, training your employees, providing product demos, etc.), prompting a model to do its video-making biddings is a fairly straightforward exercise. The evaluation criterion only needs to be which product offers the best bang for the buck. In other words, the way it should be in a free marketplace. 

As a technologist at heart, I have always hoped (often naively) that we live in a world where, pound for pound, the best technology or product wins. I know that rarely happens, even during the most peaceful of times, let alone present day. But continuing my nature of optimistic naivete, I can see genAI videos being a candidate with all the attributes of a fair, global, and borderless competition.

May the best AI car in the race wins!