Among all the Big Tech companies who are planting flags in the red hot generative AI arms race, I think the most trustworthy one may actually be Meta. Yep, Zuck’s Meta.
I say this because its recently released large language model (LLM), LLaMA (Large Language Model Meta AI), is open sourced and has the clearest set of disclosures around the model’s biases along gender, religion, race, and six other dimensions, compared to other similar LLMs.
Long time readers of Interconnected know that I have deep convictions in open source as both a superior technology development model, and as the best methodology for building trust between users, products, and the regulators who have oversight over them. That’s a long-winded way of saying: sunlight is the best disinfectant. (See my extensive archive of previous writings on this topic.)
Surprisingly or unsurprisingly, Meta, the one company that may have the most “trust deficit” with the American public, is turning to open source to bolster its AI’s trustworthiness. Meanwhile, Meta’s peers are holding their AI models close to the vest as their “secret sauce.” This divergence in approach will play out in a much bigger way as all forms of generative AI applications become more pervasive.
What made LLaMA stand out is its “model card.”
A model card is an emerging standard in the machine learning research space, where every newly-trained model publicly shares a set of performance benchmarks, intended use cases, and bias metrics along multiple cultural and demographic dimensions. A model card is usually a high-level summary of this information, similar to an open source project’s readme page or introductory sections of its documentation.
This approach was first articulated in an academic paper in 2018, “Model Cards for Model Reporting”, by Margaret Mitchell, an AI ethics researcher who used to work at Google and now works at Hugging Face (a developer platform specifically for machine learning). Since then, writing a model card as part of the release process of a new machine learning model has become more common. However, the quality of the model cards are all over the place.
LLaMA’s model card is one of the clearest and most transparent ones I’ve seen yet. Among many things, it lays out four model sizes (7B, 13B, 33B, and 65B parameters), an “Intended Use” section, and detailed information on the type of training data that went into building this model. Most importantly, it discloses a set of bias scores along gender, religion, race/color, sexual orientation, age, nationality, physical appearance, and socioeconomic status, where lower the score, less “biased” is the model. Here is how LLaMA scored:
My point of highlighting the biases section of LLaMA’s model card is not to judge how this model is biased towards whom; you can’t take a simplistic reading of these scores to say LLaMA is somehow less “racist” and more “ageist.” The bigger picture is that every machine learning model, especially the many LLMs that are powering a Cambrian explosion of chatbots, should have a clear bias disclosure like LLaMA’s, but very few do.
OpenAI’s GPT-3 model card does not disclose these bias scores. To find out how GPT-3 is biased along the same nine dimensions, you would have to dig into the LLaMA academic paper, where it presented a side by side comparison:
The Meta AI team, of course, did this at least in part to make LLaMA look good, though its average bias score is less than one point lower than GPT-3’s. In fact, GPT-3 actually scores better than LLaMA in five of the nine categories, so OpenAI has nothing to be ashamed of.
This begs the question: how come no other LLMs – from GPT-3.5 and LaMDA (Google), to ERNIE (Baidu) and Chinchilla (DeepMind, so also Google) – lay out their bias scores as clearly as Meta’s LLaMA? What is there to hide?
If these other models also made similar disclosures, then the AI research community, as well as the general public, would have a starting point to investigate and operate on good faith, rather than dragging AI into a human culture war and wasting our breath arguing if ChatGPT is “woke.”
Model Is Not Moat
Implicit in Meta’s decision to open source LLaMA is a view that the model itself is not that valuable to its business. It is a rather enlightened view and one I happen to share.
All machine learning based AI use cases – LLMs for text generation, Stable Diffusion for image creation, prediction engines for content recommendation – are subjected to the “garbage in, garbage out” problem. If you use the most vitriolic threads on Reddit as the training data for an LLM, a chatbot built using that LLM will sound toxic and hateful, no matter how advanced that model is. If your training data only contains content written in American English, speaking to that chatbot in German won’t perform that well, let alone in Japanese.
That’s why the results that a chatbot like ChatGPT generates – the “answers” you get when you chat with it – is technically called “inferences.” They are outputs inferred from the data that it was used during training.
The business value lies in the cleanliness and quality of the training data, and how closely do those data match with the task that the AI application is supposed to solve. If you are building a self-driving system, having the clearest and most updated driving and road condition data is the most important thing. If you are building a health insurance chatbot, like I have foolishly tried to do before, having accurate insurance data and medical procedure pricing is the most important thing. Other unrelated data are just noise and can have adverse effects on the model’s performance.
This is not to say that the models are not valuable. They are. Being able to build models with hundreds of billions of parameters that generate more accurate inferences faster, all the while consuming less computation when deployed is no easy feat. But this “model value” will become more of a “tax”, less of a product differentiator over time. This progression is already unfolding. The speed in which Salesforce has launched ChatGPT-like functions in Slack and its vast suite of sales and marketing SaaS products, so soon after Microsoft has done the same in its own vaste suite of enterprise applications, is a case in point. Smaller SaaS companies like ZoomInfo and Notion are doing the same thing. They are all using the same set of OpenAI models underneath the hood.
The model is not the moat. An advanced model can transform quality data into business value. But the reverse is not true: without quality data, a model is useless no matter how advanced it is – like a brand new highway without cars.
On the other hand, by keeping models as secrets in a black box, it could breed confusion and distrust. This sense of secrecy may create intrigue and attract more customers in the short term, but over time more users will want to know the “how” and “why” behind the model, which if not answered openly and transparently may decrease usage, which may lead to less quality user data as feedback and ultimately hurts the business. This problem is less of an issue with enterprise software companies like Salesforce (B2B companies don’t get dragged to Congress to testify). But for social media companies, like Meta, being open and transparent with its AI models is both good for building trust and good for business. I have shared similar views of open sourcing Twitter’s algorithm and open sourcing TikTok’s internal access system, along the same vein.
In tech, where data is plentiful but trust is scarce, open sourcing your AI models and algorithms is a rare scenario where you can have your cake and eat it too. Meta seems to have figured that out and is going above and beyond its competitors with the way it released and open sourced LLaMA.
It is an example that others should follow. It is an example that OpenAI should have set, given its original mission and commitment to open source. With harsh critiques coming from the likes of Elon Musk and others, who originally co-founded the organization, OpenAI, Google, and others may be forced to follow Meta’s example eventually, before trust begins to erode.
Meta 真的变成了一家值得信赖的 AI 公司了吗？
在所有参与火热的 AIGC 竞争的科技大厂中，我认为最值得信赖的公司今后可能会是 Meta。没错，就是扎克伯格的 Meta。
令人惊讶（或不惊讶）的是，在美国公众心里最缺乏信任的 Meta，正借助开源的力量来增强其 AI 的可信度。与此同时，Meta 的同行们却将他们的 AI 模型视为自己的 “秘密武器”。这两种不同的打发的长远影响非常值得关注，因为各种生成AI的应用只会变得更加普及和大众化。
LLaMA 最突出之处是它的 “模型卡”。
LLaMA的模型卡是我迄今为止看到的最清晰、最透明的一个。它列出了模型中四个不同版本的参数大小（7B、13B、33B和65B参数）、有一个部分专注于 “预期用途” ，以及关于训练该模型所用的数据类型的详细信息。最重要的是，它披露了一组关于性别、宗教、种族/肤色、性取向、年龄、国籍、外貌和社会经济地位的偏见度评分，分数越低，模型的偏见行 “越低”。以下是LLaMA的评分：
需要澄清的是 我们看LLaMA 模型卡中的偏见评分的目的，并不是要评判这个模型对谁有偏见 – 不能简单地通过这些分数来说 LLaMA 在某种程度上没有严重的 “种族歧视”，却有更多 “年龄歧视”。更重要的一点是，每个机器学习模型，尤其是驱动着众多聊天机器人的LLM，都应该像LLaMA一样有明确的偏见评分的披露，但很少有。
当然，Meta AI 团队之所以把LLaMA和GPT-3的偏见评分都公布在自己的论文里，或多或少是想给LLaMA打分，尽管其平均的偏见评分也就比GPT-3低一点而已。实际上，在九个偏见类别中，GPT-3在五个类别上的得分比LLaMA更低，所以OpenAI无需为此感到羞愧。
所有基于机器学习的AI用例 —— 用LLM生成文字，用Stable Diffusion做图片创造，用预测引擎推荐内容 —— 都会被 ”garbage in，garbage out” 所影响。如果训练模型时用的是 Reddit 上最恶毒的帖子作为训练数据，那么用该LLM做出来的聊天机器人就会充满恶意和仇恨，无论模型本身有多先进。如果训练数据只包含美国英语写出的文字内容，那么用德语与该聊天机器人交谈的效果就不会太好，更不用说用日语了。
这就是为什么像ChatGPT这样的聊天机器人生成的结果 — 也就是在与它聊天时获得的“答案” — 在技术上严格是被称为“推断”（inference）。它们是从训练过程中使用的数据推断出的“答案”。
我并不是说模型没有价值。它们是有价值的。能够搭建一套含有数千亿参数的模型，以更快的速度生成更准确的“推断”，同时在部署时消耗更少的计算资源都并非易事。但随着时间的推移，这类 “模型价值” 会变成一种 “税”，而不再是产品差异化的要素。这种演变已经在发生。微软在自己的企业软件中推出ChatGPT功能的不久后，Salesforce在Slack及其自己的销售和营销SaaS产品中就推出了类似功能。像ZoomInfo和Notion这些更小一点的SaaS公司也在做同样的事情。这些都是“模型不是护城河”的例子，因为在底层他们都在使用同样的OpenAI模型。
一个先进的模型可以把高质量的数据转化商业价值。但方向反过来却不成立：没有高质量的数据，无论模型多么先进，都是无用的 — 就像一条没有汽车的崭新高速公路。