Azure SQL Data Warehouse (SQL DW) is a Massively Parallel Processing Database as a Service enabling complex queries across petabytes of data quickly. This service is Microsoft's direct competitor to Snowflake, Amazon Redshift, and Google BigQuery. To the end-user it behaves much like SQL Server, however, behind the scenes SQL DW distributes the storage and processing of data across many different nodes. While this can drastically improve performance for data warehouses larger than a few terabytes, It may not be an ideal solution for smaller implementations.
- Compute vs Storage Costs
- Node-based Architecture
- Table Distribution
- When is Azure SQL Data Warehouse preferred over SQL Database?
- More Resources
Compute vs Storage Costs
Due to it's high processing power, Azure SQL Data Warehouse can cost significantly more than SQL Database. To give you an idea, SQL Database can start as low as $5-30 USD/month while Data Warehouse starts around $1,000 USD/month.
However, SQL DW helps users better manage costs by separating computation and storage of their data. Users can pause the service, releasing the compute resources back into Azure. While paused, users are only charged for the storage currently in use (roughly $125 USD/Month/Terabyte). During this time, your data remains intact but unavailable via queries. Resuming SQL DW re-allocates compute resources to your account, your data comes back online and charges resume.
Similar to Azure SQL DB's DTUs, Computation in SQL DW is Measured by Data Warehouse Units (DWUs). Adjusting DWUs will increase or decrease the performance and cost of the service.
Processing of data in SQL DW is distributed across many nodes of different types.
Control node is the brain of the data warehouse that accepts all queries and runs the MPP engine to optimize and coordinate parallel queries.
Compute nodes provide the computational power. Tables are distributed across the compute nodes to balance and scale-out processing of these tables. The number of compute nodes can be adjusted from 1-60 based on the service level.
Data Movement Service manages movement across compute nodes when required by certain queries.
To balance processing across many nodes, tables must be split up, This process is also know as sharding. The distribution method determines how rows in a table are split across nodes.
Round-robin (default) - Randomly distributes rows evenly across nodes. Not optimized for query performance as there is no logic to how the data is split. Joining to round robin tables often require shuffling data which takes more time.
Ideal for Staging Tables.
Replicated - This replicates all rows of the table on every node. As you can imagine, load times or not optimal. However queries on this table are fast because shuffling of data is never necessary.
Ideal for Dimension Tables smaller than 2GB
Hash - Rows are distributed across nodes using a "distribution column". SQL DW uses this column to distribute the data across nodes, keeping all rows with the same value in the same node. This option delivers the highest query performance for joins and aggregations on large tables.
Ideal for Fact Tables and Dimensions larger than 2GB
Choosing a Distribution Column
To help choose a distribution column that will result in the best performance, you can follow these guidelines.
- No Updates - Distribution columns cannot be updated
- Even Distribution of Values - for best performance, all distributions should have the same number of rows. Queries can only be as fast as the largest distribution. To achieve this, aim for columns that have:
- Many unique values - more uniques, higher chance of evening the distribution
- Few or No Nulls
- Not a Date Column - If all users are filtering on the same date, only one node will be doing all the processing.
- JOIN or Group By Column - Selecting a Distribution column that is commonly used in a Join or Group by clause reduces the amount of data movement to process a query.
- If no good option - Create a Composite Column using multiple join columns
When is Azure SQL Data Warehouse preferred over SQL Database?
> 1TB Database
Since Tables in Azure SQL DW are always spread across 60 distributions, performance gains are not typically realized until your data warehouse is more than 1-5 TB. As a general rule data warehouses of less than 1 TB will perform better on Azure SQL DB than on DW.
> Billion Row Tables
Database size is not the only consideration. Since Distribution happens at the table level if all your tables are less than 100 Milllion rows, You may not see a significant performance boost from SQL DW.
< 128 Concurrent Queries
Once SQL DW has received more than 128 concurrent queries, it will begin to queue them in a first-in-first-out basis. Azure SQL DB and Analysis Services can support much more concurrent queries. To resolve this limitation, Microsoft recommends feeding data into an Analysis Service Server for larger demand.
1:1 Feature Parity is not required
While, Azure SQL DW supports many of the read and write operations of traditional SQL there are significant differences. Azure SQL does not support merge statements and insert and delete operations have some limitations as well. This will typically require completely redeveloping an existing solution for Azure SQL DW.
How tables are distributed should be based on how users query the data. And the approach can drastically effect performance. So SQL DW is not a magic fix to all your query performance problems. Just like a data warehouse running on traditional SQL Server it requires monitoring and tuning of the distribution keys to ensure best performance.
Table distribution works best if your queries are relatively predictable, using the same joins and group by clauses. So SQL DW may not be the best solution for Data Discovery. In addition, tools like Power BI can run many queries at once to handle filters etc. So Analysis Services is still recommended as an intermediate layer for running Power BI.