Today’s post is a guest contribution from Shawn Xu (no relations). Shawn is one of the early engineers at, got his Masters in Human Computer Interaction at Carnegie Mellon, and writes an informative WeChat public account on SaaS called 硅谷成长攻略.

(The audio version of this post can be found on the Interconnected YouTube channel):

Type "Databricks" into Google and you'll likely see the first few suggested entries being "ipo", "careers", and "valuation". That's enough to suggest the massive hype around this Big Data unicorn -- the same kind of hype seen with Robinhood recently, or a few years back with Uber and Airbnb.

However, unlike the sharing economy and fintech, Databrick's products and core business remain known to only a small community of VCs, data scientists, and nerdy engineers like me.

This article serves to help more people -- engineer or not -- navigate Databricks, a relatively lowkey unicorn. To do that, I won’t jump straight into the technical weeds of Spark and Delta, but will instead begin with a high-level view of the broader Big Data landscape to explain how Databricks fits into the big picture and where it's headed.

Big Data Landscape in 30 Seconds

While the Big Data market is a $100B+ dollar beast with double-digit year-over-year growth, it really boils down to three key questions, two fundamental concepts, and one common goal (or the holy grail).

Three questions:

  • What is the data? (systems-generated data, commercial data, user data, etc.)
  • What to do with the data? (business intelligence, analytics, machine learning, dashboarding, etc.)
  • How to move data from A to B? (ETL (Extract, Transform, Load) and its many variants)

The answer to each question has room for a dozen different solutions. Each solution has enough market demand and commercial interest to support numerous public companies.

Two fundamentals:

  • Storage (data persisted to a physical disk, either on-premise, or in the cloud)
  • Compute (renting a server to perform some computational tasks on the stored data)

Compute is usually where most of the money is spent, meaning that keeping servers up and running is significantly more expensive than allocating physical space on disks to store data. There are also other areas of spending like networking, but that’s less relevant in our discussion about Databricks and Big Data. (For a full explanation of how storage, compute, and networking work together, see Kevin Xu’s earlier post: “How To Tax Cloud Computing?”)

One Holy Grail

The one common goal that every company wants to achieve with their data is: an automated journey that takes raw datasets and turns them into business value.

In the pre-web days, when most of our data could sit comfortably in Excel spreadsheets, this was easy: deploying a VBScript with several functions would suffice. Now, we are drowning in a gazillion YouTube video watch events -- start, pause, skip ads, etc. -- being collected, parsed, aggregated, and eventually turned into one single dashboard. And all this happens on an hourly basis.

A Thousand Paths

While the end goal is common, there are (literally) a thousand paths to reach this holy grail. Every SaaS or open source company operating in the Big Data domain falls into one or many parts of this “data journey”. While the cloud giants (AWS, Azure, GCP) are all attempting to build an all-in-one service, there's plenty of room for smaller vendors with creative solutions to find a foothold, grow, and become a giant themselves. Databricks is one of these examples.

To appreciate how vast the Big Data landscape has grown to become, I find Matt Turck's landscape graphic (updated yearly) quite handy. Over just the last few years, not only has the number of companies in each category grown, but new clusters of solutions are also popping up to form sub-categories.

So, what does Databricks do?

If you're confused by the tagline on Databricks's homepage ("All your data, analytics and AI on one Lakehouse platform"), you are not alone. But with the big picture I just laid out in mind, let’s see if we can make some sense out of this tagline and the company itself.

In essence, Databricks is a compute company that presents a low-code solution to pursue the holy grail by answering the two of the three key questions:

  • What to do with the data?
  • How to move data from A to B?

Why does answering these two questions make Databricks a valuable company? An analogy that may help illustrate this point is the more well-known example of Arduino, an open source electronic prototyping platform.

Imagine you are tasked to build an FM radio with a bunch of electronic components. It's daunting, at least if you are like me, with little hands-on hardware building experience.

Make a radio from electronics parts? (Left: Youtube - GreatScott!, Right: Wikipedia)

That daunting feeling is what engineers feel when building the “data journey” with barebone software components: renting virtual machines from AWS and writing Python scripts to compute on some datasets. You'll soon run into roadblocks like: how to coordinate multiple machines to parallelize the process, how to handle machines that go rogue and lose connection, etc.

Arduino made prototyping and building digital devices easy. Instead of starting from switches and circuit breakers, the platform provided three important offerings to appeal to non-hardware engineers:

  • A higher-level language that packages up useful functions to control the Arduino board. (For example, simple tasks like reading from a digital pin would take a non-trivial amount of C/C++ coding, but in Arduino it’s just a single function call: digitalRead(pin).)
  • An ecosystem based on breadboards, where modules can be easily pushed into the sockets to facilitate rapid prototyping.
  • A feasible path from turning Aruidno prototypes into production-ready hardware.

If you think Arduino’s platform is powerful, then it’s not hard to see how Databricks is also powerful, because it also offers three core building blocks, but for Big Data:

  • Apache Spark: an open source engine abstracting away the complexity of working with distributed computing. Spark makes the task of taking one petabyte of data stored in 1,000 machines and multiplying a column of each individual record by 2, while at the same time staying resilient to random machine outages, as simple as a single command: .map(lambda x: x * 2). This command is the same as if you were to do the same task on a single machine. Spark handles all the underlying complexities and only surfaces the higher-level functions, packaged in Python, Scala, and SQL.

The founders of Databricks, a bunch of “Berkeley hippies”, as CEO Ali Ghodsi calls themselves, were the original creators of Spark. These “hippies” live and breathe distributed computing.

  • Notebooks: a plug-and-play environment to prototype with data. Consider Notebooks as the breadboards -- an interface to get quick feedback when working with unfamiliar datasets. Each “cell” in a Notebook can run a few lines of code and show results instantly, much like an inserted Arduino component that performs some tasks.
An example Notebook with three cells and their results running

Notebooks are not unique to Databricks; it’s a common interface to many data scientists and analysts. Jupyter Notebook is another popular solution. Yet, when Spark is combined with Databricks’s Notebooks, a data scientist gets superpower. These Notebooks, by default, are connected to an enormous amount of distributed computing power via Spark, thus can run calculations on really big datasets -- something a single machine cannot achieve.

As it turns out, connecting Notebook to underlying computing resources at scale is a non-trivial engineering workload. It’s the type of work that Databricks’s core users, mostly data scientists and analysts, find boring and burdensome. In many companies, these are the same people, who are in charge of discovering the data “holy grail”! Much like people who use Arduino, they rarely have a complete design in mind to begin with. Their work is a lot of prototyping and trial and error, whether it's cleaning the data or training a machine learning model. Databricks comes in at the right moment to provide them with the familiar Notebook interface, plus the distributed computing “superpower” to make working with Big Data easy.

You may recall my earlier post on Low Code No Code and the core value proposition of these products: freeing engineers from low-impact programming. Databricks tells a perfect low-code story with its plug-and-play Notebook offering for data scientists and analysts.

  • Jobs: a pathway from development to production. So we have a dozen Notebook cells prototyped and functioning. What now? We don't want to manually run each cell whenever there is new data coming in. Also to bring down prototyping cost, we typically only bring in a subset of data to work against, not the entire dataset.

    Databricks takes a genius approach by offering a feature called "Job". A “job” grabs a working Notebook and runs it on a user-defined schedule with an auto-scaling mechanism, so it can work with the full dataset in production.

    This significantly cuts down the time required to take a working prototype from development to production. The power of Spark really comes into play here -- the same code can seamlessly transition from handling a few gigabytes of data on a few machines to processing many petabytes of data across thousands of machines.

How Does Databricks Make Money?

Like I mentioned, Databricks is a compute company. Like other companies in the category, it makes money by keeping servers up for customers.

It is worth noting that Databricks does not own any of these servers. Instead, the whole system is built upon the infrastructure of the major cloud providers (AWS, Azure, GCP). For Databricks users, their bills largely break down into two parts: the cost of keeping these machines up and running (charged by the underlying cloud provider), and the "add-on" functionalities like Notebooks and Jobs that Databricks provide on top of these machines. Both types of costs are charged on an hourly basis -- a strictly pay-as-you-go model.

Databricks is not the only company building on the shoulders of the “cloud giants”. The other well-known player and fierce competitor of Databricks, Snowflake, also employs a similar approach. Snowflake’s easy-to-use cloud data warehouse is backed by the same set of cloud infrastructure building blocks.

Is it possible for capable engineers to use the same building blocks, like AWS EC2 and S3, to build an in-house version of Databricks to reduce cost?

Certainly possible, but likely not worth it, at least not until this company reaches a certain scale. Going back to our all-code vs low-code discussion, just because you could build something doesn't mean you should. Databricks offers features like cluster auto-hibernation, sharing, and permission management to convince users that paying Databricks is better money spent than devoting engineering hours to build and maintain something similar in-house.

"All your data, analytics and AI on one Lakehouse platform"

By now, I hope this tagline is starting to make a bit more sense. Databricks is basically trying to own the entire “data journey”, which means more data, more machines, more servers to keep running, and more money for the company. Of course, Databricks customers get value by not having to do any of this work themselves.

Looking at Databricks's roadmap, it isn't hard to notice its ambition to expand beyond just a "Notebook" company (which many people consider it to be right now).

While data scientists and analysts love Notebooks and don’t mind paying the "low-code premium", data engineers are a different breed. They work closer to the data sources and are responsible for data ingestion, maintaining infrastructure, and performance tuning. These engineers value flexibility and don’t mind writing some extra code to get the job done the way they like it.

An example split of talent and tools along the data journey

Data engineers work on the earlier phases of the “data journey”, which also happen to be compute-intensive and costly (in other words, a good source of revenue). To win them over, Databricks's solution is to invest more in open source, namely by introducing Delta Lake (where Lakehouse comes from), an open source storage layer. Open-sourcing gives these engineers the much needed flexibility to exert control over data governance, movement, and reliability. And while anyone can use the ".delta" format with any solution, not just Databricks’s solution, it is supported out of the box in Databricks Notebooks. Combining that with features like "Live Table", the whole platform seeks to win over even the most hardcore data engineers by providing observability and tunability.

In the long run, getting all the data personas -- engineers, scientists, analysts -- working on the same platform is tremendously valuable in obtaining the data “holy grail”. The shared context, visibility, and collaboration lead to more agile organizations and faster product lifecycles.

Databricks wants all these personas to live in the same “Lakehouse”. It’s not quite there yet. It’s still building this “house”. But for an eight-year-old company, it’s making impressive progress.

To read all previous posts, please check out the Archive section. New content will be delivered to your inbox once a week. Follow and interact with me on: Twitter, LinkedIn, Clubhouse (@kevinsxu).



在Google 中输入“Databricks”,看到的前几个建议条目很可能是“ipo”、“工作机会”和“估值”。 这足以表明大家对这个大数据独角兽的十足兴趣——如同科技圈对这几年的Robinhood、几年前的Uber和 Airbnb广泛的热情。


这篇文章旨在帮助更多的人——无论是不是业内人士——理解Databricks,这个相对低调的独角兽。 因此,我不会上来对Spark和Delta之类的技术大聊特聊,而是先快速的纵览大数据这个独特领域,再解释Databricks是如何融入其中,前景又如何。




  • 要处理什么数据? (系统生成的日志,或是商用数据集,或是用户数据,等等)
  • 如何使用数据? (商业智能BI,商业分析,机器学习,图表,等等)
  • 如何把数据从源头向下游汇集和转化? (ETL数据管道,以及各类变种)



  • 存储(数据存放在物理磁盘上。无论是在本地的数据中心里,还是云上)
  • 计算(租用一个服务器来对数据进行计算)

计算是花销巨头,原因是,服务器的运行(按时间计算)比为磁盘的占地(空间)昂贵得多。当然,完整的来说,网络也是基础之一,但与这篇Databricks的讨论相关度不大。 (关于存储、计算和网络如何协同工作,请参阅Kevin Xu之前的帖子:“如何对云计算征税?”



在互联网风靡之前,我们的大部分数据还保存在 Excel 电子表格中时,这并不难:写几个VBScript就足够了。 现在,我们的每一个YouTube视频观看事件——开始、暂停、跳过广告等,都被收集、解析、聚合,最终变成某个高管屏幕前的柱状图。 而这一切,很可能时刻都在发生。


虽然最终目标类似,但有可以实现的方式却十分之多。每个在大数据领域运营的 SaaS 或开源公司,都致力于这条“数据流水线”的一个或多个部分。 虽然云巨头(AWS、Azure、GCP)在尝试构建一体化的服务和生态,但拥有创造性解决方案的小型供应商仍有足够的空间找到立足点,并成长并成为巨头。 Databricks、Snowflake就是典型的例子。

如果想了解当今的大数据领域是怎样一个庞然大物,Matt Turck的年度大数据版图很有参考价值。 在过去的几年里,不仅每个类别里的公司数量都有所增长,新的类别也在不断涌现。


如果你对Databricks官方主页上的标语(“一个 Lakehouse 平台,承载所有数据、分析和人工智能”)感到困惑,那你一定不是唯一一个。 但有了上面的概述,让我们看看能否从这个标语得到一些启发,来了解Databricks。


  • 如何使用数据?
  • 如何把数据从源头向下游汇集和转化?



用电子元件造一个FM收音机? (左: Youtube - GreatScott!, 右: Wikipedia)



  • 一种更容易调度Arduino硬件的高抽象语言。(比如,基本如“从一个数据针读取数据”的简单操作,用C或者C++写,工作量不小。而在Arduino平台上,这只用调用一个函数digitalRead(pin)就行了)。
  • 一个架构在面包板上的硬件生态,每一个模块都可以毫不费力地插入拔出,方便快速迭代。
  • 一套可行的、从Arduino原型到可售卖版本的转化流程。


  • Apache Spark:一个开源引擎,通过高抽象语言,极大地降低了分布式计算的门槛。设想一下,现在我们手头有1个PB的数据,分布在1000个机器组成的集群上,现在需要把每条数据里的某个值乘以2,并且处理好过程中任何机器断线造成的结果丢失。用Spark的Python接口来写,只需要一行 x: x*2)。这和在单个机器上对一个数组进行操作完全一样——这种便利背后,Spark包装了所有与机器集群打交道的复杂度,让用户可以用他们熟悉的Python、Scala、SQL语言发布命令。

Databricks的创始人们——用CEO Ali Ghodsi的自嘲来说,一群“伯克利嬉皮士”——是Spark框架的始作俑者。这些大数据嬉皮士们在分布式计算世界里是举足轻重的人物。

  • 数据笔记本:一个即时交互、反馈迅速的数据沙盒。把这个笔记本想象成Arduino的面包板,插上不同的组件,可以快速获得功能上的反馈。在数据笔记本中,用户可以在每一个“单元格”里跑几行代码,实现一个小功能,并实时给出计算结果。

数据笔记本并不是Databricks特有的功能。事实上,这是数据科学家和分析师最常用的工具之一,Jupyter Notebook可以说是最早、最知名的提供商。然而,Spark使得Databricks的数据笔记本获得了超能力:每一个数据笔记本都可以连接上数量巨大的服务器集群,并且通过Spark对大量数据做并行计算。这是单个机器上的数据笔记本远远无法达到的。




  • 任务集:从开发到生产的无缝衔接。现在我们有一个笔记本,其中包含一堆单元格,实现了一个初步的数据流水线,接下来呢?我们自然不想每次有新的数据进来,都需要手动地一个个单元格跑一遍。另外,在开发过程中,为了加快运算速度和降低成本,我们一般只会采用数据集的部分(一个样本),而非整个数据集。




归根究底,Databricks 是一家提供“计算”的公司。 与该类别的其他公司一样,它通过租出服务器的计算资源来赚钱。

值得注意的是,Databricks并不真的拥有这些服务器。 相反,其整个系统建立在主要的云提供商的基础服务上(AWS、Azure、GCP)。 对Databricks用户来说,他们的账单主要分为两部分:常规的服务器租用成本(由底层云提供商收取),以及Databricks在这些服务器上“附加”的功能费用,如数据笔记本和任务集。 这两个成本都是按小时收费的——常见的按用量收费方式。


那么,工程师们是否可以使用相同的构建模块(例如 AWS EC2 和 S3),来自己搭一个“内部版本的Databricks”,来避免给Databricks交钱,降低成本?







在“数据流水线”上,这些数据工程师们往往占据上游,也是运算量、成本极大的一个环节(换句话说,重要的收入来源)。为了赢得他们的青睐,Databricks采用的策略是拥抱开源,尤其是即Spark之后,推出了Delta Lake——一个开源的数据存储方案(这也是Lakehouse的名字来源)。开源的方案给了这些工程师们所需的自由度,使得他们能够更可预期地掌握、挪动数据。Delta的开源,意味着任何人都可以在其方案里使用优秀的“.delta”存储格式,但是这在Databricks的数据本里,有着最为方便的原生支持。与新的功能,比如”Live Table“一起,Databricks平台已经开始赢得一些硬核数据工程师的青睐。