Great post from our friends at Source link
The cloud data warehouse sits at the center of every modern data stack. Without a cloud-based data warehouse, it is nearly impossible to derive insights from your data. At its core, a data warehouse is an analytics platform where information from various data sources is stored for analysis.
This data is used to make high-level decisions and answer pressing business queries. Today, every company is either already leveraging a data warehouse or in the process of adopting one (if you are reading this, then you are probably the latter). Although there are several key players in the data warehousing space, this post will focus solely on BigQuery and Snowflake.
What Is Snowflake?
Snowflake is a Software-as-a-Service (SaaS) based warehouse solution that can run on any of the popular cloud providers (AWS, Azure, GCP). It was purpose-built for the cloud and has a few key components which make it extremely unique compared to other cloud data warehouses. Snowflake was launched publicly in 2014 and has since become a major player in the data warehousing industry being valued at $90.35 billion as of October 2021.
Snowflake was developed in the cloud and for the cloud. This means it comes with zero baggage and almost no management or operational overhead. As a native SaaS service, Snowflake handles all of the backend infrastructure so that you can focus on doing what matters most: deriving insights from your data. Snowflake offers a ton of scalability enabling near-unlimited concurrent queries.
What Is BigQuery?
Google BigQuery was first launched in 2010 as a part of Google Cloud Platform and was one of the very first data warehouse solutions available in the market. However, at the time, it was largely thought of as a complex query engine. Google BigQuery has come a long way since then and it is no longer the same solution.
Similar to Snowflake, with BigQuery you don’t have to set up or maintain any infrastructure. Instead, you can focus on discovering meaningful insights using standard SQL. Google BigQuery is completely native to Google and doesn’t run on any other cloud provider.
Snowflake is a completely serverless solution that has fully separated storage from compute and is based on ANSI SQL. Its architecture is based on an assortment of traditional shared-disk and shared-nothing architectures to provide you with the best of both worlds. It makes your data available to all compute nodes in the platform by using a central repository for persisted data.
Snowflake leverages MPP (massively parallel processing) to process all of your queries. This means that each individual compute cluster (virtual machine or server) stores a portion of your entire data set locally. For storage, Snowflake organizes your data into separate micro partitions that are then internally optimized and compressed into columnar storage.
In fact, all of the data that is loaded into Snowflake is reorganized, optimized, and compressed into a columnar format so that it can be kept in cloud storage. Snowflake automatically handles all aspects of data storage as it relates to file size, structure, compression, metadata, statistics, and other data objects that are only accessible through SQL queries and are not directly visible to you.
Processing within Snowflake is done using “virtual warehouses” or clusters of compute resources. Each warehouse is an MPP that is composed of several nodes. Snowflake’s cloud services layer coordinates all activities across Snowflake to handle everything from user requests, authentication, infrastructure management, metadata management, query parsing and optimization, access control, etc.
Google BigQuery is very similar to Snowflake in that it is serverless and separated storage from compute. It is also based on ANSI SQL. However, its architecture is quite different. BigQuery uses a vast set of multi-tenant services driven by specific Google infrastructure technologies like Dremel, Colossus, Jupiter, and Borg. Computing in Google BigQuery is done using Dremel which is a large multi-tenant compute cluster used to execute SQL queries.
Dremel does the heavy lifting by turning your SQL queries into execution trees. Tree leaves in BigQuery are called “slots”. They read data from storage and do the necessary computation. The branches of the tree are called “mixers” and they handle all aggregations. A single user on your team can harness thousands of slots to execute queries on an as-needed basis.
Similar to Snowflake, BigQuery compresses data into a columnar format to store data in Colossus, which is Google’s global storage system. Colossus manages data replication, recovery, and distributed management so that you are not reliant on any single point of failure. BigQuery uses Google’s Jupiter network to move your data rapidly from one location to another. All hardware resource allocation and orchestration in BigQuery is done through Borg, which is Google’s precursor to Kubernetes.
Snowflake offers an auto-scaling and auto suspend feature that enables clusters to stop or start during either busy or idle periods. With Snowflake your users cannot resize nodes, but they can resize clusters in a single click. Additionally, Snowflake enables you to auto-scale up to 10 warehouses with a limit of 20 DML per queue in a single table.
On a similar note, BigQuery automatically provisions your additional compute resources as needed and takes care of everything behind the scenes. However, with BigQuery there is a limit to 100 concurrent users by default. Both platforms let you scale up and down automatically based on demand. Additionally, Snowflake gives you the ability to isolate workloads across businesses in different warehouses so that different teams can operate independently with no concurrency issues.
Security and Compliance
Snowflake automatically provides encryption for data at rest. However, it does not provide granular permissions for columns, but it does provide permissions for schemas, tables, views, procedures, and other objects. Conversely, BigQuery provides security at a column-level as well as permissions on datasets, individual tables, views, and table access controls.
Since BigQuery is a native Google offering, you also have the ability to take advantage of other Google Cloud services that have built-in security and authentication to BigQuery, making integrations much easier. Snowflake does not provide any built-in virtual private networking. However, if Snowflake is hosted in AWS, AWS PrivateLink can address this issue.
On the other hand, BigQuery gives you the ability to leverage Google’s virtual private cloud. Both BigQuery and Snowflake are compliant with HIPAA, ISO 27001, PCI DSS, SOC 1 TYPE II AND soc 2 TYPE II, etc.
Both platforms support structured and semi-structured data (Avro, Parquet, Orc, CSV, JSON) and as of September 20th, 2021, Snowflake announced support for unstructured data and has made it available within public preview.
BigQuery and Snowflake enable you to manage user roles, permissions, and data security. All performance tuning happens automatically and as your data volume grows and queries become more complex, each platform automatically scales in the background to address your needs.
Additionally, since each solution is offered as a SaaS service, all of the underlying maintenance and infrastructure is handled for you. BigQuery automatically handles everything and Snowflake lets administrators scale compute and storage layers independently. This means you can isolate workloads without having to deal with the sizing and permission effort associated with virtual warehouses in Snowflake.
BigQuery and Snowflake each do a really good job when it comes to protecting your data. Snowflake has two features to help with this, Time Travel and Fail-safe. With Time Travel, Snowflake preserves a state of your data before it is updated. The standard retention period for Time Travel is one day (Enterprise customers can specify a period of up to 90 days). Time Travel can be applied to databases, schemas, and tables.
With Fail-safe Snowflake can recover historical data. This period is non-configurable and starts immediately after the time travel retention period ends. Although you must ask Snowflake to initiate the recovery, this feature is designed to allow Snowflake to recover any data that may have been damaged or lost due to extreme operational failures.
Administrators in BigQuery can easily revert changes without having to deal with the hassle of a recovery. BigQuery keeps a complete seven-day history of all changes against its tables. However, to preserve table data for longer than seven days, BigQuery offers a feature called table snapshots (snapshots are used to preserve the contents of a table at a particular point in time).
Snowflake’s pricing model bills based on the usage of each individual warehouse, so the cost is largely dependent on your overall usage. Snowflake has several warehouse sizes (X-Small, Small, Medium, Large, X-Large, etc.) which all drastically vary in cost and server/cluster amount. However, Snowflake’s base pricing for an X-small warehouse starts at about $0.00056 per second.
Pricing doubles with each increase in warehouse size. Snowflake has several plans that allow you to pre-purchase credits to cover usage. This is good since the upfront costs of Snowflake’s pre-purchase capacity plans offer lower rates compared to the on-demand option.
On the other hand, BigQuery charges for the number of bytes scanned or read. BigQuery offers on-demand pricing and flat-rate pricing. On-demand pricing charges you for the number of bytes processed in a given query at a rate of $5 per TB (the first TB of data processed per month is completely free of charge). With BigQuery’s flat-rate pricing model, you purchase slots (virtual CPUs) or dedicated resources to run your queries. The monthly cost for 100 slots is around $2,000 (this can be lowered to $1,700 with an annual commitment)
Storage costs for both Snowflake and BigQuery are relatively cheap. Snowflake charges $40 per TB per month for on-demand customers and $23 per month for upfront customers. On the other hand, BigQuery charges $20 per TB for active storage and $10 per TB for inactive storage.
As a native SaaS offering Snowflake was designed to run on any of the major cloud providers (AWS, GCP, Azure). BigQuery is a native Google Cloud offering, meaning that BigQuery is only available if you are on the Google Cloud Platform.
Out of the box with no fine-tuning, Snowflake tends to outperform every data warehouse including BigQuery on query times, having faster performance and execution times. Snowflake and BigQuery are probably more alike than unlike.
BigQuery will most likely be more efficient and have lower compute costs if you are running lots of queries occasionally with a high idle time. On the other hand, if you have more predictable and continuous usage it is likely that it will be more cost-effective to leverage Snowflake.
The Main Differences Between Snowflake and BigQuery
There are several key differences to note between Snowflake and BigQuery. Firstly, scaling within Snowflake is not entirely automatic. It requires you to give some input. On the other hand, BigQuery handles everything automatically for you.
Secondly, Snowflake runs on any of the major cloud providers, whereas BigQuery only runs on GCP. Snowflake is a full SaaS solution and BigQuery is a PaaS solution. Additionally, Snowflake has a unique feature called Secure Data Sharing which gives you the ability to share selected objects in a database with other Snowflake accounts. With Secure Data Sharing, no data is actually copied or transferred between accounts because everything happens in Snowflake’s unique services layer and metadata store.
BigQuery does not have a data sharing feature. However, BigQuery does give you the ability to create authorized views to share query execution results with particular users or groups without giving them access to the underlying tables. BigQuery also has a feature called BigQuery ML which lets you create and execute machine learning models which can greatly improve your query performance.
BigQuery definitely has the edge on Snowflake when it comes to machine learning and real-time streaming workloads. Ultimately, the use case you are trying to solve should be at the forefront of every decision you make when it comes to choosing a new cloud data platform.
What Comes After Snowflake and BigQuery? (Hint: Reverse ETL)
The entire purpose of adopting a modern cloud data warehouse is to consolidate data silos into a centralized data repository so that analysts can leverage business intelligence tools for analytics and reporting purposes to create a single source of truth. In reality, data warehouses just create a larger data silo for your team.
A data warehouse gives your team the ability to access all of your data in one place and to create high-level dashboards and reports for your key stakeholders, but none of this information is actionable for your other business teams. After all, data is only so useful when it is in a report.
This is the exact problem Reverse ETL solves. “Reverse ETL is the process of copying data from a central data warehouse to operational systems of record, including but not limited to SaaS tools used for growth, marketing, sales, and support.”
#BigQuery #Snowflake #Definitive #Guide