(bdt309) delivering results with amazon redshift, one petabyte at a time | aws re:invent 2014

26

November 12, 2014 | Las Vegas, NV Erik Selberg ([email protected] ) Samar Sodhi (samars@ amazon.com)

Upload: amazon-web-services

Post on 02-Jul-2015

1.003 views

Category:

Technology

3 download

Report

Download

Embed Size (px):

DESCRIPTION

The Amazon Enterprise Data Warehouse team, responsible for data warehousing across all of Amazon's divisions, spent 2014 working with Amazon Redshift on its largest datasets, including web log traffic. The key goals in this project were to provide a viable, enterprise-grade solution that enabled full scans of 2 trillion rows in under an hour at load. Key to success were automation of routine DW tasks that become complicated at scale: backfilling erroneous data, re-calculating statistics, re-sorting daily additions, and so forth. In this session, we discuss the scale and performance of a 100-node 1PB Amazon Redshift cluster, as well as describing some of the technical aspects and best practices of running 100-node clusters in an enterprise environment.

TRANSCRIPT

Page 1: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

November 12, 2014 | Las Vegas, NV

Erik Selberg ([email protected])

Samar Sodhi ([email protected])

mailto:[email protected]

mailto:[email protected]

Page 2: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 3: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

[email protected]

mailto:[email protected]

Page 4: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 5: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 6: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 7: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 8: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 9: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 10: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 11: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 12: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Use Case Goal Benchmark

Scan 2.25 Trillion Rows

(15 months)

60m 14m

Load 5 Billion Rows

(1 day)

60m 10m

Load 150 Billion Rows

(30 days)

24 hours 9.75 hours

Page 13: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

[email protected]

mailto:[email protected]

Page 14: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

– VACUUM is slow, physical partitions do not exist

• Doesn’t allow for parallel loads into the same table

• 15 concurrent queries

– “Bad” queries can impact the entire cluster

Page 15: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 16: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 17: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

2x

Page 18: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

– COMPUPDATE (samples the date) – fast but not optimal

Page 19: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 20: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 21: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 22: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

FASTER 86.35%

GREATER THAN 15X 14.91%

10X TO 15X 18.42%

5X TO 10X 25.73%

3X TO 5X 19.88%

2X TO 3X 7.02%

1X TO 2X 3.80%

SAME 8.47%

SLOWER 5.65%

1X TO 2X 1.75%

Page 23: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

FASTER 14.85%

3X TO 5X .56%

2X TO 3X 3.64%

1X TO 2X 10.64%

SAME 19.05%

SLOWER 66.11%

1X TO 2X 18.49%

2X TO 3X 8.96%

3X TO 5X 9.8%

5X TO 10X 10.08%

10X TO 15X 5.04%

SLOWER THAN 15X 13.73%

Page 24: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

or

30 min

48 hours

48 hours

Daily (6B) 40 8XL nodes 100 8XL nodes

Vacuum 80 min 30 min

Stats Collection 90 sec 50 sec

Monthly (150B) 40 8XL nodes 100 8XL nodes

Vacuum (Deep

Copy) 380 min 201 min

Stats Collection 22 min 4 min

Page 25: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

Page 26: (BDT309) Delivering Results with Amazon Redshift, One Petabyte at a Time | AWS re:Invent 2014

http://bit.ly/awsevals

http://bit.ly/awsevals

The Personal Petabyte The Enterprise Exabyte

PetaMongo: A Petabyte Database for as Little as $200 (BDT307) | AWS re:Invent 2013

Petabyte scale on commodity infrastructure

sqlalchemy-redshift Documentation filesqlalchemy-redshift Documentation, Release 0.7.2 Amazon Redshift dialect for SQLAlchemy. Contents 1

AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful Player Insights (GAM301-R)

Jaeger - Datalytyx · accelerate data cleansing and aggregation, plus an Amazon Redshift hosted solution to deliver a high performance petabyte-scale data warehouse in the cloud

AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)

Petabyte-Scale Text Processing with Spark

Cloud Computing & Visualizationusers.cis.fiu.edu/~giri/teach/5768/F18/lecs/Cloud-Computing.pdf•Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud

AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

DATABASES AND AWS Akhil Kattepogu F19 - wmich.edu · 10/9/19 6 Amazon Redshift q Amazon Redshift is a fast, powerful, fully managed, petabyte-scale data warehouse service in the cloud

PetaByte Storage Facility at RHIC

Amazon Redshift - Cluster Management Guide · Welcome to the Amazon Redshift Cluster Management Guide. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in

AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate (MAC302)

(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

Tera/Petabyte data distribution architectures

AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful Player Insights

Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013

AWS re:Invent 2016: Searching Inside Video at Petabyte Scale Using Spot (WIN307)

AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift (BDM402)

Amazon Redshift is a fully managed, petabyte-scale data warehouse service … · 2019-06-27 · Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud

AWS Helper Tools… · Redshift Spectrum • Query S3 data • Must have Redshift Cluster • Made for existing Redshift customers Athena • Query S3 data • No need for Redshift

Redshift Hubble’s Law. Redshift - absorption spectra

(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014

Redshift overview

Apps Associates AWS RedShift Solution › downloads › Database... · Amazon Redshift is a fast, fully managed, petabyte-scale ‘pay as you go’ data warehouse that makes it simple

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying and Managing a Global, Petabyte Scale Storage Infrastructure (STG313)

Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) | AWS re:Invent 2013

Redshift Introduction

(BDT206) See How Amazon Redshift is Powering Business Intelligence in the Enterprise | AWS re:Invent 2014

Redshift Interpretation

Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013

Netflix: Integrating Spark At Petabyte Scale