a tour of amazon redshift

Amazon RedshiftA whirl-wind tour

November 2013Kel GrahamData Architect

What is Redshift?

2

Data Warehouse Service

As the universe expands, the wavelength of radiation from objects moving away from an observer shifts towards the red end of the electromagnetic spectrum.

Redshift is a consequence of an expanding universe.

Fully managed

RedshiftFast Petabyte scale

(1PB == 1Billion MB)

Amazon Product

1/10 cost of traditional DW

3

Where does Redshift sit within the Amazon database product suite?

SimpleDB DynamoDB Redshift(PostgreSQL base)

Non-relational

Web-services interface

Query flexibility

Smaller workloads

NoSQL service

High availability

High scalability

Run off SSDs

10GB hard limit

1MB response size

Provisioned throughput

RDS(MySQL / Oracle / SQL

Server)

Integrates with Redshift

Relational database service

Referential integrity

DB-dependent feature-set (Multi-AZ)

Provisioned throughput

High availability

Online Transaction Processing

High availability

Data warehouse service

Cluster architecture

Relational database

Horizontal scalability: add more nodes

Analytics

4

What differentiates Redshift from, say, a MySQL RDS instance?

Redshift

Cluster Architecture

Columnar storage

Read OptimisedNo RI by design

5


(i) Cluster architecturea) Clients connect via existing protocols

to the Leader Node.b) Leader node develops a query plan

and may generate and compile C++ code to be executed by the compute nodes

c) Leader node will distribute work across compute nodes using Distribution Keys (more later)

d) Compute nodes receive work from leader node and may transmit data amongst themselves to answer the query

e) Leader aggregates the results and returns to client

f) Leader can distribute bulk data loads across compute nodes: I have loaded 3G of raw data (gzipped to 500Mb) on a single node in under 3 minutes)

source: http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html

6


(ii) Column-store databasea) Relational databases tend to

store data on a tuple by tuple basis.

b) When querying the data, the engine needs to read more blocks of data, discarding much of the data just read in order to return columns being queried

c) A column-store stores columns contiguously in the same block

d) Result: the number of IO operations involved in a query can be significantly reduced, dependent on the shape of the data

source: http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html

7


(iii) Optimised for Read performancea) Contrast block sizes with other databases:

a) Default MySQL installs on ext3 file-systems use 4k blocksb) Default NTFS partitions use 4k blocks, so SQL Server on

NTFS defaults to 4k blocks as well.b) Redshift’s focus on Data Warehousing (and hence read

optimisation) allows them to use a 1,024KB block sizec) Under a column-store architecture each block holds the same

kind of data, so datatype-specific compression enables even more data to be stored per block, further reducing disk space and IO

d) Reduced disk space and IO helps improve inter-node data sharing and replication, where compute nodes may redistribute data based on a table’s distribution key (more later on that)

All your blox are belong to me!

8


(iv) No referential integrity

Ok, sounds g… WHAT?!!

• No primary key• No foreign key• No index support• No sequences• No user defined functions• No stored procedures• No common table expressions• No exotic data types – no arrays, JSON, Geospatial types, etc.• No ‘alter column’ syntax – drop and reload

Do tell Redshift about primary, foreign keys and column uniqueness. It won’t enforce them, but it will use these hints to better understand queries.

9

How does Redshift locate data?

The Sort Key

• Each table can have a single Sort Key – a compound key, comprised of 1 to 400 columns from the table

• Redshift will store data on disk in Sort Key order – so think of it as the single clustered index for the table

• Sort keys should be selected based on how the table is used:• Columns that are frequently used to join to other tables should be included in the sort key• Date and timestamp columns that are used in filtering operations should be included

• Redshift stores metadata about each data block, including the min and max of each column value – using this, Redshift can skip entire blocks when answering a query

• After data loads or inserts, the ANALYZE command should be run• ANALYZE updates the table metadata that is used by the query planner – very important for

column-based data storage and ongoing query performance

10

How does Redshift locate data?

The Distribution Key

• Redshift will distribute and replicate data between compute nodes in order to get best use of the parallelism available in the cluster• By default, data will be spread evenly across all compute nodes (EVEN distribution)• A node is further broken down into slices – one slice per CPU core• Each slice participates in the parallel execution of a job sent from the Leader node, so the

even distribution of data across the nodes is vital to ensuring consistent query performance• If data is denormalised and does not participate in joins, then an EVEN distribution won’t

be problematic

• Alternatively a Distribution key can be provided (KEY distribution)• The Distribution key is important, in that it helps define which data is kept together on a

given node.• The objective is to choose a key that helps distribute data across a node’s slices, but not

across the cluster’s nodes• Similarly to the Sort Key, the Distribution key is defined on a per-table basis, but unlike a

Sort Key, the Distribution Key is comprised of only a single column

11

What typical RDBMS features does Redshift have?

Features• Transactions• Reasonable number of windowing functions

• Rank, First, Last, Lag, Sum, Nth and so on• Most types of relational joins

• Inner, Left, Right, Full, Cross• Correlated sub-queries are supported, but only where

the query planner can decorrelate them for performance (sub-queries during a join are a no-go)

• Views• Excellent locking and concurrent write capabilities

• Thanks PostgreSQL!• Schema management• Identity columns (auto_increment)

DataTypes (complete list):• SmallInt• Integer• Bigint• Decimal• Real• Double precision• Boolean• Char• Varchar• Date• Timestamp

12

Other features?

• Close integration with S3 and DynamoDB• Our test instance was primed from S3:

COPY <tableName> from s3://bucket/file.csv.gzheader as 1GZIP

• COPY command is central to the import process – can load data in parallel, using what it knows about the structure of the target table to assign work to individual compute nodes

• UNLOAD will export data from a Redshift table out to an S3 bucket

• Excellent set of database system tables that allow one to monitor pretty much everything that’s going on:

• Loads• Queries• Chatter between compute nodes• Sort and distribution keys

13

Other features (cont)?

• Column compression• Each column can have an optionally assigned compression algorithm, including:

• BYTEDICT – essentially a key-value lookup for up to 256 values. Useful for repeating data, such as “State” in a property record

• DELTA – stores the initial value of a column as per its data type, and then stores only the offset between the next value and the first value. Very useful for dates

• RUNLEGNTH – Stores the value of a column and the number of times the value is repeated. Useful when the data is stored consecutively – relevant for sort-key

• MOSTLY8/16/32 – uses traditional numeric compression, but allows for outliers

14

Other features (cont)?

• Excellent monitoring and management console integration

15

How well does Redshift perform?

“Return the current list of all valid properties within a selected list of states and tell me what the current number of bedrooms, bathrooms, car spaces, land size, floor size and year built is.”

We ran some rudimentary queries over a realistic data set…

…and found that Redshift outperformed our existing database by a factory 2.5 – 3.5.

Correct select of a SORT KEY in Redshift is vital. Any filtering or joins on a non-sortkey column will result in (slow) a table scan. In our example, this reduced performance by 30%.

However, this is not a particularly useful comparison as, these were different machines with different hardware specifications.

16

Summary

Leader Node

Redshift

Column store

No referential integrity Cluster Architecture

Compute Node

Column

Compression

Distribution Key

Fast

Massive ParallelismSort key

Relational Accessible (JDBC, ODBC, S3)

a tour of amazon redshift

Technology

data block

blocks of data

data source

replicate data

data warehousing

kind of data

amazon redshift

sort key redshift