washington, d.c.d36cz9buwru1tt.cloudfront.net/145ab-130-emr-for... · aws dynamodb social scale...

Post on 22-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2013 AWS Worldwide Public Sector Summit Washington, D.C.

EMR for Fun and for Profit

Ben Butler | Sr. Manager, Big Data

butlerb@amazon.com | @bensbutler

2013 AWS Worldwide Public Sector Summit

Overview

1. What is big data? 2. What is AWS Elastic

MapReduce?

3. What data is available? 4. How to use AWS EMR to

support my agency’s mission?

What is Big Data?

So what is it?

When your data sets become

so large that you have to start innovating around

how to collect, store, organize, analyze and share it

Compute Storage Big Data

100

GB

1,000

PB

Challenges start at relatively small volumes

Compute Storage Big Data

GB TB PB

Compute Storage Big Data Unconstrained data growth

95% of the 1.2 zettabytes of data in the digital universe is unstructured

70% of of this is user-generated content

Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.

Source: IDC

ZB

EB

Web sites Blogs/Reviews/Emails/Pictures

Social Graphs Facebook, Linked-in, Contacts

Application server logs Web sites, games

Sensor data Weather, water, smart grids

Images/videos Traffic, security cameras

Twitter 50m tweets/day 1,400% growth/year

Where does it come from?

Compute Storage Big Data

Innovation

Why AWS and big data?

Amazon

S3

Amazon

DynamoDB

Amazon

RedShift Spot

HPC EMR

Compute Storage

How do you get your slice of it?

AWS Direct Connect

Dedicated low latency

bandwidth

Queuing

Highly scalable event

buffering

Amazon Storage Gateway

Sync local storage to the cloud

AWS Import/Export

Physical media shipping

Compute Storage Big Data

AWS Relational Database

Service

Fully managed database

(MySQL, Oracle, MSSQL)

AWS DynamoDB

NoSQL, Schema-less,

Provisioned throughput

database

Amazon S3

Object datastore up to 5TB

per object

99.999999999% durability

Where do you put your slice of it?

AWS SimpleDB

NoSQL, Schema-less

Smaller datasets

Compute Storage Big Data

Amazon Glacier

Long term cold storage

From $0.01 per GB/Month

99.999999999% durability

Where do you put your slice of it?

Compute Storage Big Data

Scale Price

Performance

How quick do you need to read it?

Single digit ms 10s-100s ms <5 hours

AWS DynamoDB

Social scale applications Provisioned throughput performance

Flexible consistency models

AWS S3

Any object, any app 99.999999999% durability

Objects up to 5TB in size

AWS Glacier

Media & asset archives Extremely low cost

S3 levels of durability

Compute Storage Big Data

Scale Price

Performance

Operate at any scale

Unlimited data

Compute Storage Big Data

Data App App

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Data has gravity

Compute Storage Big Data

Data

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Compute Storage Big Data …and inertia at volume…

Data

…easier to move applications to the data

Compute Storage Big Data

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Bring compute capacity to the data

Very large dataset seeks

strong & consistent

compute for short term

relationship, possibly

longer

Compute Storage Big Data

Compute Storage Big Data Flexible compute resources, on demand

Vertical

Scaling

From $0.02/hr

Amazon Elastic Compute Cloud (EC2) Basic unit of compute capacity

Range of CPU, memory & local disk options

17 Instance types available, from micro through cluster compute to SSD backed

Feature Details

Flexible Run Windows or Linux distributions

Scalable Wide range of instance types from micro to cluster compute

Machine Images Configurations can be saved as machine images (AMIs) from which

new instances can be created

Full control Full root or administrator rights

VM Import/Export Import and export VM images to transfer configurations in and out of

EC2

Monitoring Publishes metrics to Cloud Watch

Inexpensive On-demand, Reserved and Spot instance types

Secure Full firewall control via Security Groups

On and Off Fast Growth

Variable peaks Predictable peaks

Elastic capacity as you need it

Compute Storage Big Data

On and Off Fast Growth

Predictable peaks Variable peaks

WASTE

CUSTOMER DISSATISFACTION

Elastic capacity as you need it

Compute Storage Big Data

Elastic cloud capacity

Traditional

IT capacity

Your IT needs

Time

Capacity

Elastic capacity as you need it

Compute Storage Big Data

Fast Growth On and Off

Predictable peaks Variable peaks

Elastic capacity as you need it

Compute Storage Big Data

From one instance…

Compute Storage Big Data

…to thousands

Compute Storage Big Data

Innovation

Why AWS and big data?

S3

DynamoDB RedShift

Spot

HPC EMR

Compute Storage

Innovation

Why AWS and big data?

S3

DynamoDB RedShift

Spot

HPC EMR

Compute Storage

AWS EMR – Elastic MapReduce

2013 AWS Worldwide Public Sector Summit

A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases

Amazon Elastic MapReduce

What is EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Very large

click log

(e.g TBs)

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Aggregate

the results

from all the

nodes

Very large

click log

(e.g TBs)

What

John

Smith

did

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Aggregate

the results

from all the

nodes

What

John

Smith

did

Very large

click log

(e.g TBs) Insight in a fraction of the time

2013 AWS Worldwide Public Sector Summit

How does it work?

EMR

Amazon EMR

Cluster Amazon S3

1. Put the data into

S3 (or HDFS)

3. Get the results

2. Launch your cluster.

Choose:

• Hadoop distribution

• How many nodes

• Node type (hi-CPU,

hi-memory, etc.)

• Hadoop apps (Hive,

Pig, HBase)

2013 AWS Worldwide Public Sector Summit

EMR

How does it work?

Amazon S3

You can

easily resize

the cluster Amazon EMR

Cluster

2013 AWS Worldwide Public Sector Summit

EMR

How does it work?

Amazon S3

Use Spot

nodes to save

time and

money

Amazon EMR

Cluster

2013 AWS Worldwide Public Sector Summit

EMR

How does it work?

Amazon S3

Launch parallel clusters

against the same data

source (tune for the

workload)

Amazon EMR

Clusters

2013 AWS Worldwide Public Sector Summit

How does it work?

Amazon S3

When the work is complete,

you can terminate the cluster

(and stop paying)

Amazon EMR

Cluster

2013 AWS Worldwide Public Sector Summit

Amazon EMR Cluster

How does it work?

You can store

everything in HDFS

(local disk)

High Storage nodes

= 48 TB/node

2013 AWS Worldwide Public Sector Summit

Launch in a Virtual

Private Cloud for

extra security

Amazon EMR Cluster

How does it work?

Amazon EMR cluster Start an Amazon

EMR cluster

using AWS

Management

Console or AWS

Command Line

Interface tools

Master instance group Amazon EMR cluster

Master instance

group created

that controls the

cluster (runs

MySQL)

Master instance group Amazon EMR cluster

Core instance group

Core instance

group created for

life of cluster

Master instance group Amazon EMR cluster

Core instance group

HDFS HDFS

Core instances

run DataNode

and TaskTracker

daemons

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Optional task

instances can be

added or

subtracted to

perform work

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Amazon S3 can

be used as

underlying file

system for

input/output data

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Master node

coordinates

distribution of

work and

manages cluster

state

Master instance group Amazon EMR cluster

Task instance group Core instance group

HDFS HDFS

Amazon S3

Core and Task

instances read-write

to Amazon S3

map Input

file reduce Output

file

Amazon EC2 instance

Shuffle

map Input

file reduce Output

file

map Input

file reduce Output

file

map Input

file reduce Output

file

EC2 instance

EC2 instance

Amazon EC2 instance

Shuffle

HDFS

Amazon EMR

Pig

HDFS

Amazon S3 Amazon

DynamoDB

Amazon EMR

HDFS

Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

HDFS

Pig

Analytics languages Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

HDFS

Pig

Amazon

RDS

Analytics languages Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

HDFS

Pig

Analytics languages Data management

Amazon

RedShift AWS Data Pipeline

Amazon EMR Amazon

RDS

Amazon S3 Amazon

DynamoDB

Amazon’s Public Data Sets

2013 AWS Worldwide Public Sector Summit

Amazon Public Data Sets

2013 AWS Worldwide Public Sector Summit

• 270+ TB and growing dataset, hosted for free in

AWS cloud

• Researchers no longer need massive on-

premises storage and compute

• Collaboration revolution: not just shared data but

“executable papers”

1000 Genomes Project

Using EMR for your Mission

• Lack of constraints leads to new usage models

• Gives control back to individual development teams

• Fail-fast (and fail-cheap) opens up exploratory style

• Many customers create 100s of Amazon EMR clusters per day

• Classic burst-y workload perfect for the cloud

• Big data / HPC clusters themselves are parallelized resources

• Can you build a faster on-premises cluster? Yes, but…

• Usually a shared/contented resource; in cloud, each user/workgroup gets their own

cluster

• Cloud is often the fastest platform based on “MTTJC”

(Mean Time To Job Completion)

Cloud Democratizes Big Data/HPC

• No-obligation use allows for experimentation, prototypes and

operational/business pilots

• Faster time from inception of idea to solution

• Provides a platform that can scale to meet the massive needs of large

data sets

• Bottom line:

• Enables experimentation and innovation without large capital investments

• Improves ROI for Big Data projects

• http://aws.amazon.com/hpc-applications/

Cloud and Big Data/HPC

Try the tutorial: aws.amazon.com/articles/2855

Find out more: aws.amazon.com/big-data

Thank you!

Ben Butler, Sr. Manager, AWS Marketing

top related