It is the fast, petabyte-scale data warehouse service which is designed for OLAP scenarios and is mostly used in high performance analysis and reporting of large datasets.
It gives you fast querying capabilities over structured data using standard SQL commands. It can connect well with ODBC and JDBC which further helps to integrates with various data loading, reporting, datamining, and analytics tool.
Clusters and Nodes:
A cluster is composed of a leader node and one or more compute nodes. A clients interacts directly with leader nodes and compute nodes are transparent to external applications.
Amazon RedShifts supports Six different nodes types and each can be of different CPU, memory and Storage and which are further grouped in to Dense compute or Dense Storage class.
A Dense Compute nodes types supports clusters up to 326TB using fast SSD where as Dense storage nodes supports clusters up to 2PB using large magnetic disks.
Following figure shows component of Amazon RedShift Dataware house cluster.
Each cluster contains one or more databases. User data for each table is distributed across the compute nodes. Your application or SQL client communicates with Amazon Redshift using standard JDBC or ODBC connections with the leader node, which in turn coordinates query execution with the compute nodes. Your application does not interact directly with the compute nodes.
The disk storage for a compute node is divided into a number of slices. The number of slices per node depends on the node size of the cluster and typically varies between 2 and 16. The nodes all participate in parallel query execution, working on data that is distributed as evenly as possible across the slices.
You can increase query performance by adding multiple nodes to a cluster. When you submit a query, Amazon Redshift distributes and executes the query in parallel across all of a cluster’s compute nodes. Amazon Redshift also spreads your table data across all compute nodes in a cluster based on a distribution strategy that you specify. This partitioning of data across multiple compute resources allows you to achieve high levels of performance.
Amazon Redshift allows you to resize a cluster to add storage and compute capacity over time as your needs evolve. You can also change the node type of a cluster and keep the overall size the same. Whenever you perform a resize operation, Amazon Redshift will create a new cluster and migrate data from the old cluster to the new one. During a resize operation, the database will become read-only until the operation is finished.
Table Design: Amazon Redshift supports one or more database and each database contains many tables which has rows and column and on which data is kept. CRETATE TABLE command is used to create table.
Data Types: Amazon Redshift supports numeric data types like: INTEGER, DECIMAL, DOUBLE. It supports text data types like CHAR, VARCHAR and date data types like DATE and TIMESTAMP.
Compression Encoding: Data compression is one of the main feature of Amazon RedShift. When we load data for first time in to empty table, Amazon RedShift automatically sample data and select the best compression techniques for each column or you can specifically mention encoding scheme on per column basis when we issue CREATE TABLE command.
One of the primary decisions when creating a table in Amazon Redshift is how to distribute the records across the nodes and slices in a cluster. You can configure the distribution style of a table to give Amazon Redshift hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute nodes as needed to perform any joins and aggregates.
When creating a table, you can choose between one of three distribution styles: EVEN, KEY, or ALL.
- EVEN distributionThis is the default option and results in the data being distributed across the slices in a uniform fashion regardless of the data.
- KEY distributionWith KEY distribution, the rows are distributed according to the values in one column. The leader node will store matching values close together and increase query performance for joins.
- ALL distributionWith ALL, a full copy of the entire table is distributed to every node. This is useful for lookup tables and other large tables that are not updated frequently.
Another important decision to make during the creation of a table is whether to specify one or more columns as sort keys. Sorting enables efficient handling of range-restricted predicates. If a query uses a range-restricted predicate, the query processor can rapidly skip over large numbers of blocks during table scans.
The sort keys for a table can be either compound or interleaved. A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order. An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
Amazon Redshift supports standard SQL commands like INSERT and UPDATE to create and modify records in a table. For bulk operations, however, Amazon Redshift provides the COPY command as a much more efficient alternative than repeatedly calling INSERT.
When loading data from Amazon S3, the COPY command can read from multiple files at the same time. Amazon Redshift can distribute the workload to the nodes and perform the load process in parallel. Instead of having one single large file with your data, you can enable parallel processing by having a cluster with multiple nodes and multiple input files.
After each bulk data load that modifies a significant amount of data, you will need to perform a VACUUM command to reorganize your data and reclaim space after deletes. It is also recommended to run an ANALYZE command to update table statistics.
To Query table, Amazon Redshift allows us to write SQL commands. For large Amazon Redshifts clusters which supports large numbers of users, WLM (Workload Management) is configured to queue and priorities Queries.
Similar to Amazon RDS, you can create point-in-time snapshots of your Amazon Redshift cluster. A snapshot can then be used to restore a copy or create a clone of your original Amazon Redshift cluster. Snapshots are durably stored internally in Amazon S3 by Amazon Redshift.
Amazon Redshift supports both automated snapshots and manual snapshots. With automated snapshots, Amazon Redshift will periodically take snapshots of your cluster and keep a copy for a configurable retention period. You can also perform manual snapshots and share them across regions or even with other AWS accounts. Manual snapshots are retained until you explicitly delete them.