Great post from our friends at Source link
There is a lot of discussion surrounding Snowflake and Databricks in determining which modern cloud solution is better for analytics. However, both solutions were purpose-built to handle different tasks, so neither should be compared from an “apples to apples” perspective.
With that in mind, I’ll do my best to break down some of the core differences between the two and share the pros/cons of each as unbiasedly as possible. Before diving into the weeds of Snowflake and Databricks though, it is important to understand the overall ecosystem.
The Difference Between Cloud Data Warehouses, Data Lakes, and Data Lakehouses
A data warehouse is an analytics database and central repository for data that has already been structured and filtered. The main users of a data warehouse are typically data analysts who have a vast knowledge of SQL and are extremely competent at manipulating data. Warehouse data is optimized and transformed so that it can be accessed extremely quickly. Data within the warehouse is queried and analyzed to derive insights and build reports and dashboards to drive business outcomes.
A data lake is similar to a data warehouse in that it collects and stores data. However, data lakes are designed to handle Big data or large amounts of raw data, unstructured data, or semi-structured data. Due to this, data lakes typically house much larger quantities of data compared to a warehouse and address more use cases. Data lakes were originally adopted by large enterprise companies like Yahoo and Google in the early 2000s because they were acquiring such large amounts of data.
Data lakes are extremely proficient at handling streaming data that is being ingested in real-time. In most cases, some data is loaded into a data lake to be stored for future use. Because data lakes consist of raw unprocessed data, a data scientist with specialized expertise is typically needed to manipulate and translate the data. One of the main use cases for a data lake is to leverage it for batch processing, while also enabling self-service ELT for data ingestion, processing, and scheduling to automate the complexity of building and maintaining data pipelines.
A data lakehouse tries to take the best of both the warehouse and the data lake and combine them into one platform. It’s often thought of as a new architecture that places a high emphasis on open data management, whereas the architecture for warehouses and data lakes is relatively simple. The goal of a data lakehouse is to eliminate multiple query engines by running analytics directly against a data lake itself. To learn more, check out our deep dive on data lakehouses.
What Is Snowflake?
Snowflake is a cloud-based data warehouse that is purpose-built to run on any of the major cloud providers like AWS, Microsoft Azure, and Google Cloud Platform (GCP). It is a Software-as-a-Service (SaaS) solution that is designed to enable businesses to collect and consolidate data into a centralized location for analytics purposes.
Snowflake is considered one of the largest companies in the cloud data warehousing industry. At its core, Snowflake is a self-serve platform built to support Business Intelligence use cases. It lets users leverage SQL to query data and create reports and dashboards to drive business decisions.
What Is Databricks?
Similar to Snowflake, Databricks is a cloud-based data platform. However, Databricks is a data lake, not a data warehouse. Today though, Databricks falls more into the category of a data lakehouse. Whereas Snowflake focuses on analytics and reporting, Databricks places a higher emphasis on machine learning, data science, and streaming use cases. Supporting multiple development languages, Databricks is more tailored towards data engineers and data scientists.
As a big data platform, Databricks is largely used to store large amounts of raw unprocessed data. Databricks is based on Apache Spark and was actually built by the original creators of Spark. In simple terms, it is a management layer built around Apache Spark’s distributed computing framework, designed to remove the burden of managing and maintaining infrastructure.
Snowflake is a serverless solution based on ANSI SQL that has completely separated storage and compute processing layers. Snowflake leverages massively parallel processing (MPP) to process queries, with every individual virtual warehouse (i.e., compute clusters) storing a portion of the entire data set locally. Snowflake uses micro partitions to organize and internally optimize data into a compressed columnar format so that it can be kept in cloud storage.
This is all done automatically because Snowflake manages everything from file size, compression, structure, metadata, statistics, and other data objects that are not directly visible to the user and can only be accessed through SQL queries. All processing within Snowflake is done using compute clusters known as virtual warehouses, which consist of several MPP nodes.
As a SaaS solution Snowflake manages a lot on the back end from user requests, infrastructure management, metadata, authentication, query parsing, access control, optimization, etc. Best of all it runs on all three major clouds, AWS, GCP, and Azure. Snowflake’s warehousing technology makes it really simple for users to query data extremely fast with SQL.
Databricks is similar to Snowflake in that it is a SaaS solution, but the architecture is quite different because it is based on Spark. Spark is a multi-language engine built around single nodes or clusters that can be deployed in the cloud. Similar to Snowflake, Databricks currently runs on AWS, GCP, & Azure.
It operates out of a control plane and a data plane. The data plane includes any data that is processed and the control plane includes all backend services managed by Databricks. Like Snowflake, Databricks is also serverless, which means that it supports near-unlimited concurrency. There are several core components to Databricks architecture:
Databricks Delta Lake
Delta lake is Databricks’ version of a data warehouse. It is an additional storage system that is built to run on top of a conventional data lake. At its core, Delta Lake unifies streaming and batch data processing on existing data lakes and acts as the middle ground between Spark computing and cloud storage. Delta Lake architecture consists of 3 table types:
- Bronze Tables: raw data
- Silver Tables: slightly clean data – not ready for consumption
- Gold Tables: clean consumable data
All data kept within Delta tables is stored as a parquet file in cloud storage.
Databricks Delta Engine
Delta Engine is a high-performance query engine that is compatible with Apache spark that helps process the data in Delta Lake. It works to accelerate the performance of Delta Lake for SQL and DataFrame workloads. This is done through its improved query optimizer, which acts as a caching layer strategically placed between the execution layer and cloud object storage, and an execution engine that is written in C++.
Notebooks contain runnable code, visualizations, and narrative text that can be accessed through a web-based interface. With these notebooks, users can collaboratively build models using Scala, R, SQL, and Python.
MLflow is another open-source platform that Databricks created to configure machine learning environments and run tests from existing libraries like Spark, TensorFlow, ONNX, etc. to help manage machine learning and data science lifecycles reliably at scale. MLflow has three core components, experiment tracking, model management, and model development.
Snowflake has an auto-scaling and auto suspend feature that stops and starts clusters during idle and busy periods. Snowflake does not let users resize nodes, but clusters can be resized in a single click. Users can even autoscale up to ten warehouses with a limit of twenty DML per queue in a single table.
Databricks also has an auto-scaling feature where clusters spin up or down depending on usage from both individual queries and concurrent users. However, making changes within Databricks often requires substantially more effort as the UI is more complex since it is designed for data scientists.
Security and Compliance
Within Snowflake, all data is automatically encrypted at rest. Likewise, all data stored within Databricks is encrypted and all communications that take place between the control plane and data plane happens within the cloud provider’s private network.
Both solutions provide RBAC (role-based access control). Snowflake and Databricks are both compliant with an assortment of regulations and certifications like SOC 2 Type II, ISO 27001, HIPAA, GDPR, etc.
However, unlike Snowflake, Databricks does not have a storage layer because the service runs on top of object-level storage like AWS S3, Azure Blob Storage, Google Cloud Storage, etc.
Snowflake and Databricks both support semi-structured (Parquet, Avro, Orc, CSV, JSON) and structured data. As a data lakehouse, Databricks is designed to handle any time type or format of data including unstructured data. Snowflake announced support for unstructured data in September of 2021. Databricks definitely has a leg up in this area because it supports multiple development languages.
Snowflake was really designed as a data analyst tool with its core competencies based around SQL. SQL is a great tool, but it is mainly useful at handling data transformations. Databricks’ Spark engine is much more suited for addressing streaming, ML, AI, and data science workloads because users can leverage languages. Snowflake only just recently announced support for Python, Java, and Scala with the introduction of Snowpark.
Databricks has eliminated a large amount of the infrastructure effort that was associated with managing and operating Spark, but there is still a lot of manual input required on the user’s part to resize clusters, update configurations, and switch computing options. Databricks also has a high barrier to entry because the learning curve is much steeper.
Snowflake is much simpler because it is SQL-based – it only takes a few mouse clicks to get started. Databricks allows users to secure logs, control job properties and ownership, in addition to job execution. Snowflake also provides granular control over objects, roles, users, privileges, access, etc.
Snowflake has two unique features known as Time Travel and Fail-safe. Snowflake’s Time Travel feature preserves a state of data before it is updated. In general Time Travel is limited to one day, but Enterprise customers can specify a period of up to 90 days. This feature can be applied to tables, schemas, and databases. Fail-safe is a 7-day period that begins immediately after the Time Travel retention period ends and is used to protect and recover historical data.
Databricks Delta Lake also has a Time Travel feature, that works very similar to Snowflake. Data stored within Delta Lake is automatically versioned so that historical versions of that data can be accessed for future use. One of the major advantages of Databricks lies in the fact, that it runs on Spark and since Spark is based on object-level storage, Databricks does not actually ever store any data. This also means that Databricks could address on-prem use cases.
Snowflake’s billing model is based on individual warehouse usage. Warehouses come in several sizes (X-Small, Small, Medium, Large, X-Large, etc.). The cost and server/cluster number for all of these sizes vary drastically. Base pricing for an X-small Snowflake warehouse at approximately $0.00056 per second. There is likely some sort of discount depending on the tier chosen (i.e. Standard, Standard +, Enterprise, and Business Critical).
However, with each increase in warehouse size, the pricing doubles. Snowflake does offer several plans to pre-purchase credits based on usage. In most cases, the pre-purchase capacity plans offer lower rates compared to the on-demand price. Snowflake’s “pay-as-you-go” model has eliminated the expensive capacity that companies had to purchase with traditional warehouse solutions. On-demand storage costs for Snowflake begin at $40 per month and $23 per TB for upfront customers.
Storage within Databricks is substantially and cheaper compared to Snowflake because everything is stored within the customer’s own object-level storage environment and this can be highly optimized as some data may not need to be accessed as frequently. For example, storage in S3 starts at $23 per terabyte, but this could get substantially lower depending on the scale of data and how frequently it needs to be accessed.
Compute pricing for Databricks is based on DBUs (Databricks processing units). There are three business price tiers, Classic, Premium, and Enterprise. Pricing starts as low as $0.07 per DBU and can get as high as $0.65 depending on the DBU size.
As managed SaaS services, Snowflake and Databricks both do a really good job of handling all of the back-end infrastructure required to get their solutions up and running. Since Databricks, is based around Spark though, more manual input and fine-tuning is required to fully leverage the solution. Both solutions can run in multiple different cloud environments though.
Snowflake and Databricks are such drastically different solutions that it is really difficult to compare them from a performance standpoint. Every benchmark can be tailored to tell a slightly different story. The recent drama surrounding the recent TPC-DS benchmark done by Databricks is a good example of this.
Snowflake and Databricks enable slightly different use cases and neither is necessarily better than the other as it relates to a head-to-head solution. However, it is important to note that Snowflake optimizes all storage at the time of ingestion for data access, which can make it a better solution for interactive queries.
The Main Difference Between Databricks and Snowflake
Snowflake is extremely robust at handling production-level Business Intelligence workloads, where reports and dashboards are needed to run or be produced on a consistent basis. It was designed to replace the traditional data warehouse and provide faster performance.
However, it is not optimized to handle gigantic amounts of data, specifically streaming use cases because it is based around SQL. Snowflake was built to democratize data and facilitate the work of data analysts in a simple fashion. It is beginning to branch out and handle more workloads, but the core use case it will always be is data warehousing.
As a solution that is based on Spark, Databricks lakehouse platform enables a broader range of capabilities, specifically around ELT, data science, and machine learning. Databricks lets customers keep data in their own managed object storage and provides a managed Delta Lake (data processing engine) and Delta Engine (SQL query engine) experience on top.
With Databricks Delta Lake and Delta Engine, the platform can basically do everything that Snowflake can do and more. The architecture is designed to cover all data workloads. However, Databricks is a much more complex tool, optimizing the platform and building a fully functional lakehouse can take some time, and there is substantially more maintenance to keep everything up and running compared to Snowflake.
The easiest way to understand the difference between Snowflake and Databricks is to think of a box of Lego. Databricks provides all of the pieces and the instruction manual to set up a fully functional data lakehouse, and Snowflake comes pre-built and ready to be used for immediate value.
At the end of the day, both data platforms can enable high-performance SQL queries and provide benefits for customers. In many cases, Databricks and Snowflake often work parallel, with Databricks handling data engineering, ETL/ELT, data science, and streaming workloads, and Snowflake handling production data to power analytics.
What Comes After Databricks and Snowflake? (Hint: Reverse ETL)
Databricks and Snowflake both work to consolidate siloed data into a central repository for analysis, so that data analysts, data scientists, and engineers can make use of that data. The problem is, this data is only ever used for reporting purposes. It only exists in a report or Dashboard, so data lakehouses and data warehouses, which were originally adopted to de-silo data have inadvertently become a data silo in themselves.
All of the data within these platforms is only accessible to technical users. In reality, the business teams like sales, marketing, support, customer success, etc. are the ones who need access to this information the most. Data needs to be operationalized and this is the exact problem that Hightouch solves with Reverse ETL.
#Databricks #Snowflake #Definitive #Guide