data & analytics - session 2 - introducing amazon redshift

47
Steffen Krause, Technical Evangelist Introducing Amazon Redshift

Upload: amazon-web-services

Post on 05-Dec-2014

5.373 views

Category:

Technology


0 download

DESCRIPTION

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. This presentation will give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more. Steffen Krause, Technical Evangelist, AWS Padraic Mulligan, Architect and Lead Developer and Mike McCarthy, CTO, Skillspage

TRANSCRIPT

Page 1: Data & Analytics - Session 2 - Introducing Amazon Redshift

Steffen Krause, Technical Evangelist

Introducing Amazon Redshift

Page 2: Data & Analytics - Session 2 - Introducing Amazon Redshift

Data warehousing done the AWS way

• No upfront costs, pay as you go

• Really fast performance at a really low price

• Open and flexible with support for popular tools

• Easy to provision and scale up massively

Page 3: Data & Analytics - Session 2 - Introducing Amazon Redshift

We set out to build…

A fast and powerful, petabyte-scale data warehouse that is:

Delivered as a managed service

A Lot Faster

A Lot Cheaper

A Lot Simpler

Amazon Redshift

Page 4: Data & Analytics - Session 2 - Introducing Amazon Redshift

We’re off to a good start

Page 5: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift dramatically reduces I/O

ID Age State

123 20 CA

345 25 WA

678 40 FL

Row storage Column storage

Scan Direction

Page 6: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift automatically compresses your data

• Compress saves space and reduces disk I/O

• COPY automatically analyzes and compresses

your data

– Samples data; selects best compression encoding

– Supports: byte dictionary, delta, mostly n, run

length, text

• Customers see 4-8x space savings with real data

– 20x and higher possible based on data set

• ANALYZE COMPRESSION to see details

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

Page 7: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift architecture

• Leader Node – SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes – Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3

– Parallel load from Amazon DynamoDB

• Single node version available

10 GigE (HPC)

Ingestion Backup Restore

JDBC/ODBC

Page 8: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift runs on optimized hardware

HS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage

• Optimized for I/O intensive workloads

• High disk density

• Runs in HPC - fast network

• HS1.8XL available on Amazon EC2

Page 9: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup

• Restore

• Resize

10 GigE (HPC)

Ingestion Backup Restore

JDBC/ODBC

Page 10: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift lets you start small and grow big

Extra Large Node (HS1.XL) 3 spindles, 2 TB, 16 GB RAM, 2 cores

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL) 24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE

Cluster 2-100 Nodes (32 TB – 1.6 PB)

Note: Nodes not to scale

Page 11: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift is priced to let you analyze all your data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing

Number of Nodes x Cost per Hour

No charge for Leader Node

No upfront costs

Pay as you go

Page 12: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Page 13: Data & Analytics - Session 2 - Introducing Amazon Redshift

Provision a data warehouse in minutes

Page 14: Data & Analytics - Session 2 - Introducing Amazon Redshift

Monitor query performance

Page 15: Data & Analytics - Session 2 - Introducing Amazon Redshift

Point and click resize

Page 16: Data & Analytics - Session 2 - Introducing Amazon Redshift

Resize your cluster while remaining online

• New target provisioned in the background

• Only charged for source cluster

Page 17: Data & Analytics - Session 2 - Introducing Amazon Redshift

Resize your cluster while remaining online

• Fully automated

– Data automatically redistributed

• Read only mode during resize

• Parallel node-to-node data copy

• Automatic DNS-based endpoint cutover

• Only charged for one cluster

Page 18: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift has security built-in

• SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated

– All blocks on disks and in Amazon S3

encrypted

• No direct access to compute nodes

• Amazon VPC support

10 GigE (HPC)

Ingestion Backup Restore

Customer VPC

Internal VPC

JDBC/ODBC

Page 19: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift continuously backs up your data and

recovers from failures

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of

data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental

– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and nodes

• Able to restore snapshots to any Availability Zone within a region

Page 20: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift integrates with multiple data sources

Amazon

DynamoDB

Amazon Elastic

MapReduce

Amazon Simple

Storage Service (S3)

Amazon Elastic Compute Cloud (EC2)

AWS Storage Gateway Service

Corporate Data Center

Amazon Relational

Database Service

(RDS)

Amazon Redshift

More coming soon…

Page 21: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift provides multiple data loading options

• Upload to Amazon S3

• AWS Import/Export

• AWS Direct Connect

• Work with a partner

Data Integration

Systems Integrators

More coming soon…

Page 22: Data & Analytics - Session 2 - Introducing Amazon Redshift

Amazon Redshift works with your existing analysis tools

JDBC/ODBC

Amazon Redshift

More coming soon…

Page 23: Data & Analytics - Session 2 - Introducing Amazon Redshift
Page 24: Data & Analytics - Session 2 - Introducing Amazon Redshift

Customer Use Case

Mike McCarthy

CTO, SkillPages

[email protected]

Page 25: Data & Analytics - Session 2 - Introducing Amazon Redshift

One Place to Find Skilled People

Page 26: Data & Analytics - Session 2 - Introducing Amazon Redshift

Everyone Needs

Skilled People

At Home

At Work

In Life

Repeatedly

Page 27: Data & Analytics - Session 2 - Introducing Amazon Redshift
Page 28: Data & Analytics - Session 2 - Introducing Amazon Redshift

2 million

15 million

REGISTERED

MEMBERS

2011 2012 2013

Page 29: Data & Analytics - Session 2 - Introducing Amazon Redshift

77 Instances 3 Availability Zones

2.5+ Billion Relationships Tech Team of 21 10M+ Growth Increments

Reserved/Demand & Spot

Add capacity as required

Auto scale

US East (Northern VA)

Planned for Multi Region

Our social graph models

over 2.5 Billion social

relationships

Ready for additional 10 million

users at any point in time 1 Data Analyst

Total company size 37

150M+ Emails

>150,000,000 emails sent

per month

Page 30: Data & Analytics - Session 2 - Introducing Amazon Redshift

21,000,000+ SKILLS ADDED BY MEMBERS

Page 31: Data & Analytics - Session 2 - Introducing Amazon Redshift

1,500,000+ NEW MEMBERS/MONTH

Page 32: Data & Analytics - Session 2 - Introducing Amazon Redshift

1,200,000,000+ SOCIAL CONNECTIONS IMPORTED

Page 33: Data & Analytics - Session 2 - Introducing Amazon Redshift

2 SECONDS A NEW MEMBER EVERY

Page 34: Data & Analytics - Session 2 - Introducing Amazon Redshift

We Measure Everything!

Page 35: Data & Analytics - Session 2 - Introducing Amazon Redshift

Why Measure?

• Business Insights

• KPIs

• Campaign Management

• Behavioural Analysis

• Algorithm Improvements

• Performance Management

Best user experience

Page 36: Data & Analytics - Session 2 - Introducing Amazon Redshift

History with Redshift

• Amazon Customer since 2010

• Proprietary SQL Data Warehouse 2011

• Rapid Growth 2012

• Redshift Trials 2012

• Redshift Production DW 2013

Page 37: Data & Analytics - Session 2 - Introducing Amazon Redshift

Data Architecture

Data Analyst

Raw Data

Get

Data

Join via Facebook

Add a Skill Page

Invite Friends

Web Servers Amazon S3 User Action Trace Events

EMR Hive Scripts Process Content

• Process log files with

regular expressions to

parse out the info we need.

• Processes cookies into

useful searchable data such

as Session, UserId, API

Security token.

• Filters surplus info like

internal varnish logging.

Amazon S3

Aggregated Data

Raw Events

Internal Web

Excel Tableau

Amazon Redshift

Page 38: Data & Analytics - Session 2 - Introducing Amazon Redshift

EMR

• Heavy Lifting

• Log Parsing & Data Extraction • Cookies

• Clickstream

• Directory Generation

• Network Processing

• Process 40GB+ Telemetry data daily

• Reserved & Spot Instances

Page 39: Data & Analytics - Session 2 - Introducing Amazon Redshift

Redshift Implementation

• High Storage Extra Large (XL) DW Node • Growing from 2 xDW.HS1.XLARGE nodes

• Reservations

• ETL Activities • Approx. 90 minutes including exports from RDBMS, copying to S3,

loading stage tables, loading target tables, vacuuming and analysing tables

• Schema

• Compression • Starting to use columnar compression

• Retention

Page 40: Data & Analytics - Session 2 - Introducing Amazon Redshift

DW Anatomy Dimension Purpose

Users Analyse the composition of the user base

Events Analyse significant actions that reflect user activity & behaviour

Clickstream Analyse user browsing and landing events at a page level

Email Click through and Cohort Analysis

Notifications Analyse user to user messaging – what users are mailing what

users and when.

Sessions Traffic & Visit Analysis

Skills Analyse Skills by Classification and User Context

Opportunities Analyse Opportunities by Classification, User Context and

response rate.

Search Analyse and quantify the characteristics of each search made on

the platform.

Page 41: Data & Analytics - Session 2 - Introducing Amazon Redshift

Performance

Page 42: Data & Analytics - Session 2 - Introducing Amazon Redshift

Accessing Data

• Consumers

• Tableau

• Excel/PowerPivot

• Technical Team

• Sqlworkbench

Driver: JDBC for postgressql 8.xx

Page 43: Data & Analytics - Session 2 - Introducing Amazon Redshift

Data Visualisation

Page 44: Data & Analytics - Session 2 - Introducing Amazon Redshift

Redshift - Nice to haves

• Possibility to load lzo files from S3

• Additional analytical functions e.g. MEDIAN

• Hierarchies

• ETL tool working with S3, many database vendors

Page 45: Data & Analytics - Session 2 - Introducing Amazon Redshift

Why Redshift works for SkillPages

• Scale - MPP

• Performance

• Columnar

• Platform Integration

• S3, Dynamo

• Operational Advantages

• Ease of Access

• Cost

Page 46: Data & Analytics - Session 2 - Introducing Amazon Redshift

Thank you!

Customer Use Case

Mike McCarthy

CTO, SkillPages

[email protected]

Page 47: Data & Analytics - Session 2 - Introducing Amazon Redshift

Resources & Questions

• Steffen Krause | [email protected] | @AWS_Aktuell

• http://aws.amazon.com/redshift

• https://aws.amazon.com/marketplace/redshift/

• https://www.jaspersoft.com/webinar-AWS-Agile-Reporting-and-Analytics-in-the-Cloud