deep dive amazon redshift for big data analytics - september webinar series

62
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pavan Pothukuchi, Principal Product Manager , AWS September 20, 2016 Deep Dive: Amazon Redshift for Big Data Analytics

Upload: amazon-web-services

Post on 15-Apr-2017

842 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Pavan Pothukuchi, Principal Product Manager , AWS

September 20, 2016

Deep Dive: Amazon Redshift for Big Data Analytics

Page 2: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Agenda

• Service Overview• Best Practices

• Schema / Table Design• Data Ingestion• Database Tuning• Migration

• Examples

Page 3: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Service Overview

Page 4: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon Redshift

a lot fastera lot simplera lot cheaper

Page 5: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Selected Amazon Redshift customers

Page 6: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Amazon Redshift system architectureLeader node

• SQL endpoint• Stores metadata• Coordinates query execution

Compute nodes• Local, columnar storage• Execute queries in parallel• Load, backup, restore via

Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH

Two hardware platforms• Optimized for data processing• DS2: HDD; scale from 2TB to 2PB• DC1: SSD; scale from 160GB to 326TB

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

S3 / EMR / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

Page 7: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

A deeper look at compute node architecture

Each node contains multiple slices• DS2 – 2 slices on XL, 16 on 8XL• DC1 – 2 slices on L, 32 on 8XL

A slice can be thought as a “virtual compute node”

• Unit of data partitioning • Parallel query processing

Facts about slices:• Each compute node has either 2, 16, or 32

slices• Table rows are distributed to slices• A slice processes only its own data

Leader Node

Page 8: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• Calculating SUM(Amount) with row storage:

– Need to read everything– Unnecessary I/O

ID Age State Amount

Page 9: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• Calculating SUM(Amount) with column storage:

– Only scan the necessary blocks

ID Age State Amount

Page 10: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

• Columnar compression

– Effective due to like data – Reduces storage requirements– Reduces I/O

ID Age State Amount

analyze compression orders; Table | Column | Encoding--------+-------------+---------- orders | id | mostly32 orders | age | mostly32 orders | state | lzo orders | amount | mostly32

Page 11: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Amazon Redshift dramatically reduces I/OColumn storage

Data compression

Zone maps

• In-memory block metadata• Contains per-block MIN and MAX value• Effectively prunes blocks which don’t

contain data for a given query• Minimize unnecessary I/O

ID Age State Amount

Page 12: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Best Practices: Schema Design

Page 13: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Data Distribution

• Distribution style is a table property which dictates how that table’s data is distributed throughout the cluster:• KEY: Value is hashed, same value goes to same location (slice)• ALL: Full table data goes to first slice of every node• EVEN: Round robin

• Goals:• Distribute data evenly for parallel processing• Minimize data movement during query processing

KEY

ALLNode 1

Slice 1 Slice 2

Node 2

Slice 3 Slice 4

Node 1

Slice 1 Slice 2

Node 2

Slice 3 Slice 4

keyA

keyB

keyC

keyD

Node 1

Slice 1 Slice 2

Node 2

Slice 3 Slice 4

EVEN

Page 14: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

1

2

3

4

ID Gender Name

101 M John Smith

306 F Lisa Green

ID Gender Name

292 F Jane Jones

209 M James White

ID Gender Name

139 M Peter Black

164 M Brian Snail

ID Gender Name

446 M Pat Partridge

658 F Sarah Cyan

RoundRobin

DISTSTYLE EVEN

Page 15: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

HashFunction

1

2

3

4

ID Gender Name

101 M John Smith

306 F Lisa Green

ID Gender Name

292 F Jane Jones

209 M James White

ID Gender Name

139 M Peter Black

164 M Brian Snail

ID Gender Name

446 M Pat Partridge

658 F Sarah CyanDISTSTYLE KEY

Page 16: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

HashFunction

1

2

3

4

ID Gender Name

101 M John Smith

139 M Peter Black

446 M Pat Partridge

164 M Brian Snail

209 M James White

ID Gender Name

292 F Jane Jones

658 F Sarah Cyan

306 F Lisa Green

DISTSTYLE KEY

Page 17: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

1

2

3

4

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

ALL

DISTSTYLE ALL

Page 18: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

CUSTOMERS

CUST_ID GENDER NAME

101 M John Smith

306 F James White

ORDERS

ORDER_ID CUST_ID

Amount

A1600 101 120

B8765 306 340

RESULTS

CUST_ID GENDER Amount

101 M 120

306 F 340

CUSTOMERS

CUST_ID GENDER NAME

292 F Jane Jones

209 M Lyall Green

ORDERS

ORDER_ID CUST_ID

Amount

C0967 292 750

D8753 209 601

RESULTS

CUST_ID GENDER Amount

292 F 750

209 M 601

Page 19: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

CUSTOMERS

CUST_ID GENDER NAME

101 M John Smith

306 F James White

ORDERS

ORDER_ID CUST_ID

Amount

A1600 101 120

B8765 306 340

RESULTS

CUST_ID GENDER Amount

101 M 120

306 F 340

CUSTOMERS

CUST_ID GENDER NAME

292 F Jane Jones

209 M Lyall Green

ORDERS

ORDER_ID CUST_ID

Amount

C0967 292 750

D8753 209 601

RESULTS

CUST_ID GENDER Amount

292 F 750

209 M 601

Page 20: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Choosing a Distribution Style

KEY• Large FACT tables• Large or rapidly changing

tables used in joins• Localize columns used within

aggregations

ALL• Have slowly changing data• Reasonable size (i.e., few

millions but not 100’s of millions of rows)

• No common distribution key for frequent joins

• Typical use case – joined dimension table without a common distribution key

EVEN• Tables not frequently joined or

aggregated• Large tables without acceptable

candidate keys

Page 21: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Data Sorting

GoalsPhysically order rows of table data based on certain column(s)Optimize effectiveness of zone mapsEnable MERGE JOIN operations

ImpactEnables rrscans to prune blocks by leveraging zone mapsOverall reduction in block IO

Achieved with the table property SORTKEY defined over one or more columns

Optimal SORTKEY is dependent on:Query patternsData profileBusiness requirements

Page 22: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Zone Maps

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’

MIN: 01-JUNE-2013MAX: 20-JUNE-2013

MIN: 08-JUNE-2013MAX: 30-JUNE-2013

MIN: 12-JUNE-2013MAX: 20-JUNE-2013

MIN: 02-JUNE-2013MAX: 25-JUNE-2013

MIN: 06-JUNE-2013MAX: 12-JUNE-2013

Unsorted Table

MIN: 01-JUNE-2013MAX: 06-JUNE-2013

MIN: 07-JUNE-2013MAX: 12-JUNE-2013

MIN: 13-JUNE-2013MAX: 18-JUNE-2013

MIN: 19-JUNE-2013MAX: 24-JUNE-2013

MIN: 25-JUNE-2013MAX: 30-JUNE-2013

Sorted By Date

READ

READ

READ

READ

READ

Page 23: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Single Column• Table is sorted by 1 column

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

[ SORTKEY ( date ) ]

Best for: • Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group bys

Page 24: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Compound

Date Region Country

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Korea

2-JUN-2015 Asia Singapore

2-JUN-2015 Europe Germany

3-JUN-2015 Asia Hong Kong

3-JUN-2015 Asia Korea

[ SORTKEY COMPOUND ( date, region, country) ]

Best for: • Queries that use 1st column as primary filter, then other cols• Can speed up joins and group bys

Page 25: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Interleaved• Equal weight is given to each column.

Date Region Country

2-JUN-2015 Africa Zaire

3-JUN-2015 Asia Singapore

2-JUN-2015 Asia Korea

2-JUN-2015 Europe Germany

3-JUN-2015 Asia Hong Kong

2-JUN-2015 Asia Korea

[ SORTKEY INTERLEAVED ( date, region, country) ]

Best for: • Queries that use different columns in filter• Queries get faster the more columns used in the filter

Page 26: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

COMPOUND• Most Common• Well defined filter criteria• Time-series data

Choosing a SORTKEY

INTERLEAVED• Edge Cases• Large tables (>Billion Rows)• No common filter criteria• Non time-series data

• Primarily as a query predicate (date, identifier, …)• Optionally choose a column frequently used for aggregates• Optionally choose same as distribution key column for most efficient

joins (merge join)

Page 27: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Compressing Data

• COPY automatically analyzes and compresses data when loading into empty tables

• ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column

• Changing column encoding requires a table rebuild

Page 28: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Compressing Data

If you have a regular ETL process and you use temp tables or staging tables, turn off automatic compression

• Use analyze compression to determine the right encodings• Bake those encodings into your DML• Use CREATE TABLE … LIKE

Page 29: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Compressing Data

• From the zone maps we know:• Which block(s) contain the range• Which row offsets to scan

• Highly compressed sort keys: • Many rows per block • Large row offset

Skip compression on just the leading column of the compound sortkey

0000

100K

0000

100K

20K

Page 30: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Best Practices: Ingestion

Page 31: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Amazon Redshift Loading Data Overview

AWS CloudCorporate Data center

AmazonDynamoDB

Amazon S3

Data Volume

Amazon Elastic MapReduce

Amazon RDS

Amazon Redshift

Amazon Glacier

logs / files

Source DBs

VPN Connection

AWS Direct Connect

S3 Multipart Upload

AWS Import/ Export

EC2 or On-Prem (using

SSH)

Page 32: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Parallelism is a function of load files

DS2.8XL Compute Node

Each slice’s query processors are able to load one file at a time• Streaming Decompression• Parse• Distribute• Write

A single input file means only one slice is ingesting data

Realizing only partial cluster usage as 6.25% of slices are active

0 2 4 6 8 10 12 141 3 5 7 9 11 13 15

Single Input File

Page 33: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Maximize Throughput with Multiple Files

Use at least as many input files as there are slices in cluster

With 16 input files, all slices are working so you maximize throughput

COPY continues to scale linearly as you add additional nodes

16 Input Files

DS2.8XL Compute Node

0 2 4 6 8 10 12 141 3 5 7 9 11 13 15

Page 34: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

New feature: ALTER TABLE APPEND

ELT workloads typically “massage” or aggregate data in a staging table and then append to production table

ALTER TABLE APPEND moves data from staging to production table by manipulating metadata

Much faster than INSERT INTO as data is not duplicated

Page 35: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Best Practices: Performance Tuning

Page 36: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Optimizing a database for querying

• Periodically check your table status• Vacuum and Analyze regularly

• SVV_TABLE_INFO• Missing statistics• Table skew• Uncompressed Columns• Unsorted Data

• Check your cluster status• WLM queuing• Commit queuing• Database Locks

Page 37: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Missing Statistics

• Amazon Redshift’s query optimizer relies on up-to-date statistics

• Statistics are only necessary for data which you are accessing

• Updated stats important on:• SORTKEY• DISTKEY• Columns in query predicates

Page 38: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Table Skew• Unbalanced workload• Query completes as fast as the

slowest slice completes• Can cause skew inflight:

• Temp data fills a single node resulting in query failure

Table Maintenance and Status

Unsorted Table• Sortkey is just a guide, but data

needs to actually be sorted• VACUUM or DEEP COPY to

sort• Scans against unsorted tables

continue to benefit from zone maps:

• Load sequential blocks

Page 39: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

WLM QueueIdentify short/long-running queries and prioritize them

Define multiple queues to route queries appropriately.

Default concurrency of 5

Leverage wlm_apex_hourly to tune WLM based on peak concurrency requirements

Cluster Status: Commits and WLM

Commit QueueHow long is your commit queue?

• Identify needless transactions

• Group dependent statements within a single transaction

• Offload operational workloads

• STL_COMMIT_STATS

Page 40: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Cluster Status: Database Locks

• Database Locks• Read locks, Write locks, Exclusive locks• Reads block exclusive• Writes block writes and exclusive• Exclusives block everything

• Ungranted locks block subsequent lock requests• Exposed through SVV_TRANSACTIONS

Page 41: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Migration Considerations

Page 42: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Typical ETL/ELT on legacy data warehouse

• One file per table, maybe a few if too big• Many updates (“massage” the data)• Every job clears the data, then loads• Count on primary key to block double loads• High concurrency of load jobs• Small table(s) to control the job stream

Page 43: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Two questions to ask

Why you do what you do?• Many times, users don’t know

What is the customer need?• Many times, needs do not match current practice• You might benefit from adding other AWS services

Page 44: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

On Amazon Redshift

Updates are delete + insert of the row• Deletes just mark rows for deletion

Blocks are immutable• Minimum space used is one block per column, per slice

Commits are expensive• 4 GB write on 8XL per node• Mirrors WHOLE dictionary• Cluster-wide serialized

Page 45: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

On Amazon Redshift

• Not all aggregations created equal• Pre-aggregation can help• Order on group by matters

• Concurrency should be low for better throughput• Caching layer for dashboards is recommended• WLM parcels RAM to queries. Use multiple queues for

better control.

Page 46: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Workload Management (WLM)

Concurrency and memory can now be changed dynamicallyYou can have distinct values for load time and query time

Use wlm_apex_hourly.sql to monitor “queue pressure”

Page 47: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

New Feature – WLM Queue Hopping

Page 48: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Query throughput vs. Concurrency

• Query throughput (QPM or QPH) is more representative of end user experience than concurrency

• Several improvements over the last 6 months• Commit improvements• Dynamic resource management• Query throughput doubled over the last 6 months

Page 49: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Resources

https://github.com/awslabs/amazon-redshift-utilshttps://github.com/awslabs/amazon-redshift-monitoringhttps://github.com/awslabs/amazon-redshift-udfshttps://s3.amazonaws.com/chriz-webinar/webinar.zip

Admin scriptsCollection of utilities for running diagnostics on your cluster

Admin viewsCollection of utilities for managing your cluster, generating schema DDL, etc.

ColumnEncodingUtilityGives you the ability to apply optimal column encoding to an established schema with data already loaded

Page 50: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Monday, October 24, 2016 JW Marriot Austin

https://aws.amazon.com/events/devday-austin

Free, one-day developer event featuring tracks, labs, and workshops around Serverless,

Containers, IoT, and Mobile

Q&A If you want to learn more, register for our upcoming DevDay Austin:

Page 51: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Appendix: Performance optimization examples

Page 52: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Use SORTKEYs to effectively prune blocks

Page 53: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Use SORTKEYs to effectively prune blocks

Page 54: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Use SORTKEYs to effectively prune blocks

Page 55: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Don’t compress initial SORTKEY column

Page 56: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Use compression encoding to reduce I/O

Page 57: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Choose a DISTKEY which avoids data skew

Page 58: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Ingest: Disable predictable compression analysis

Page 59: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Ingest: Load multiple files to match cluster slices

Page 60: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

VACUUM to physically removed deleted rows

Page 61: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

VACUUM to keep your tables sorted

Page 62: Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Gather statistics to assist the query planner