Redshift distribution keys

8/5/2023

If you set the table to be distributed across the cluster using the ‘department code’ column as the key, Redshift will ensure that all the costs for a given department are neatly placed onto a single given node. Let’s say you had a table of costs, with a column called department code. Key distribution across nodes is really important. ‘Key’ – With a key distribution set, you specify a column to distribute on and then, cleverly, AWS Redshift ensures that all the rows with the same value of that key are placed on the same node. Of course, if you join an ‘Even’ and an ‘All’ table together, no redistribution is required because the rows of the ‘All’ table are available everywhere. With no joins involved, this is a good choice, but when joins are involved then the rows matched by different tables involved in the join may not all be on the same node and need to be distributed over the network. Queries involving that table are then distributed over the cluster with each slice on each node working to provide the answer in parallel. ‘Even’ – Specifying ‘Even’ distribution spreads the table rows over all the nodes in the cluster, well, evenly! The downside of using ‘All’ is that you have a copy of the table on every node in the cluster – taking up space, and increasing the length of time that it takes to use the ‘Copy’ command to upload data into Redshift, and ultimately meaning that you’ll need a larger cluster. If a particular node was tasked with completing part of a joined query and didn’t have a required table locally, it would have to get the data it needed across the network, negatively and significantly affecting query performance. As such, AWS Redshift does not have to get involved copying the required data across the network from node to node, to complete the query. The upside of this is that when you are asking the cluster to return a query which includes a join, each node executing that join definitely has a local copy of the table you have distributed using ‘All’. If you set a distribution style of ‘All’, you instruct Redshift to simply make a copy of the table to every node in the cluster. ‘All’ is the simplest distribution style. There are 3 different distribution styles and it’s important to understand each one, as well as how they work together. The options you choose here also have an impact on data storage requirements, required cluster size and the length of time it takes to execute the ‘Copy’ command (i.e. We have found that how you specify distribution style is super important in terms of ensuring good query performance for queries with joins. for each table in your cluster, you tell AWS Redshift how you want to distribute it… All, Even or Key. When you apply distribution style at table level i.e. For instance, a distribution style of ‘All’ copies the data across all nodes. The distribution style is how the data is distributed across the nodes in AWS Redshift. Choosing the right distribution styles is crucial when it comes to optimising AWS Redshift performance, and if you don’t get this right, it can lead to a significant slowdown in query performance.

0 Comments

Redshift distribution keys

Leave a Reply.

Author

Archives

Categories