washington, d.c.d36cz9buwru1tt.cloudfront.net/145ab-130-emr-for... · aws dynamodb social scale...

2013 AWS Worldwide Public Sector Summit Washington, D.C.

EMR for Fun and for Profit

Ben Butler | Sr. Manager, Big Data

butlerb@amazon.com | @bensbutler

2013 AWS Worldwide Public Sector Summit

Overview

1. What is big data? 2. What is AWS Elastic

MapReduce?

3. What data is available? 4. How to use AWS EMR to

support my agency’s mission?

What is Big Data?

So what is it?

When your data sets become

so large that you have to start innovating around

how to collect, store, organize, analyze and share it

Compute Storage Big Data

Challenges start at relatively small volumes

GB TB PB

Compute Storage Big Data Unconstrained data growth

95% of the 1.2 zettabytes of data in the digital universe is unstructured

70% of of this is user-generated content

Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 – 2012.

Source: IDC

Web sites Blogs/Reviews/Emails/Pictures

Social Graphs Facebook, Linked-in, Contacts

Application server logs Web sites, games

Sensor data Weather, water, smart grids

Images/videos Traffic, security cameras

Twitter 50m tweets/day 1,400% growth/year

Where does it come from?

Innovation

Why AWS and big data?

Amazon

DynamoDB

Amazon

RedShift Spot

HPC EMR

Compute Storage

How do you get your slice of it?

AWS Direct Connect

Dedicated low latency

bandwidth

Queuing

Highly scalable event

buffering

Amazon Storage Gateway

Sync local storage to the cloud

AWS Import/Export

Physical media shipping

AWS Relational Database

Service

Fully managed database

(MySQL, Oracle, MSSQL)

AWS DynamoDB

NoSQL, Schema-less,

Provisioned throughput

database

Amazon S3

Object datastore up to 5TB

per object

99.999999999% durability

Where do you put your slice of it?

AWS SimpleDB

NoSQL, Schema-less

Smaller datasets

Amazon Glacier

Long term cold storage

From $0.01 per GB/Month

99.999999999% durability

Where do you put your slice of it?

Scale Price

Performance

How quick do you need to read it?

Single digit ms 10s-100s ms <5 hours

AWS DynamoDB

Social scale applications Provisioned throughput performance

Flexible consistency models

AWS S3

Any object, any app 99.999999999% durability

Objects up to 5TB in size

AWS Glacier

Media & asset archives Extremely low cost

S3 levels of durability

Scale Price

Performance

Operate at any scale

Unlimited data

Data App App

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Data has gravity

Compute Storage Big Data …and inertia at volume…

…easier to move applications to the data

Bring compute capacity to the data

Very large dataset seeks

strong & consistent

compute for short term

relationship, possibly

longer

Compute Storage Big Data Flexible compute resources, on demand

Vertical

Scaling

From $0.02/hr

Amazon Elastic Compute Cloud (EC2) Basic unit of compute capacity

Range of CPU, memory & local disk options

17 Instance types available, from micro through cluster compute to SSD backed

Feature Details

Flexible Run Windows or Linux distributions

Scalable Wide range of instance types from micro to cluster compute

Machine Images Configurations can be saved as machine images (AMIs) from which

new instances can be created

Full control Full root or administrator rights

VM Import/Export Import and export VM images to transfer configurations in and out of

Monitoring Publishes metrics to Cloud Watch

Inexpensive On-demand, Reserved and Spot instance types

Secure Full firewall control via Security Groups

On and Off Fast Growth

Variable peaks Predictable peaks

Elastic capacity as you need it

On and Off Fast Growth

Predictable peaks Variable peaks

CUSTOMER DISSATISFACTION

Elastic cloud capacity

Traditional

IT capacity

Your IT needs

Capacity

Fast Growth On and Off

Predictable peaks Variable peaks

From one instance…

…to thousands

Innovation

DynamoDB RedShift

HPC EMR

Compute Storage

Innovation

DynamoDB RedShift

HPC EMR

Compute Storage

AWS EMR – Elastic MapReduce

A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases

Amazon Elastic MapReduce

What is EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Very large

click log

(e.g TBs)

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Aggregate

the results

from all the

Very large

click log

(e.g TBs)

Lots of actions

by John Smith

Split the

log into

many small

pieces

Process in an

EMR cluster

Aggregate

the results

from all the

Very large

click log

(e.g TBs) Insight in a fraction of the time

How does it work?

Amazon EMR

Cluster Amazon S3

1. Put the data into

S3 (or HDFS)

3. Get the results

2. Launch your cluster.

Choose:

• Hadoop distribution

• How many nodes

• Node type (hi-CPU,

hi-memory, etc.)

• Hadoop apps (Hive,

Pig, HBase)

How does it work?

Amazon S3

You can

easily resize

the cluster Amazon EMR

Cluster

How does it work?

Amazon S3

Use Spot

nodes to save

time and

Amazon EMR

Cluster

How does it work?

Amazon S3

Launch parallel clusters

against the same data

source (tune for the

workload)

Amazon EMR

Clusters

How does it work?

Amazon S3

When the work is complete,

you can terminate the cluster

(and stop paying)

Amazon EMR

Cluster

Amazon EMR Cluster

How does it work?

You can store

everything in HDFS

(local disk)

High Storage nodes

= 48 TB/node

Launch in a Virtual

Private Cloud for

extra security

Amazon EMR Cluster

How does it work?

Amazon EMR cluster Start an Amazon

EMR cluster

using AWS

Management

Console or AWS

Command Line

Interface tools

Master instance group Amazon EMR cluster

Master instance

group created

that controls the

cluster (runs

MySQL)

Core instance group

Core instance

group created for

life of cluster

Core instance group

HDFS HDFS

Core instances

run DataNode

and TaskTracker

daemons

Task instance group Core instance group

HDFS HDFS

Optional task

instances can be

added or

subtracted to

perform work

HDFS HDFS

Amazon S3

Amazon S3 can

be used as

underlying file

system for

input/output data

HDFS HDFS

Amazon S3

Master node

coordinates

distribution of

work and

manages cluster

HDFS HDFS

Amazon S3

Core and Task

instances read-write

to Amazon S3

map Input

file reduce Output

Amazon EC2 instance

Shuffle

map Input

file reduce Output

map Input

file reduce Output

map Input

file reduce Output

EC2 instance

Amazon EC2 instance

Shuffle

Amazon EMR

Amazon S3 Amazon

DynamoDB

Amazon EMR

Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

Analytics languages Data management

Amazon EMR

Amazon S3 Amazon

DynamoDB

Amazon

Amazon EMR

Amazon S3 Amazon

DynamoDB

Amazon

RedShift AWS Data Pipeline

Amazon EMR Amazon

Amazon S3 Amazon

DynamoDB

Amazon’s Public Data Sets

Amazon Public Data Sets

• 270+ TB and growing dataset, hosted for free in

AWS cloud

• Researchers no longer need massive on-

premises storage and compute

• Collaboration revolution: not just shared data but

“executable papers”

1000 Genomes Project

Using EMR for your Mission

• Lack of constraints leads to new usage models

• Gives control back to individual development teams

• Fail-fast (and fail-cheap) opens up exploratory style

• Many customers create 100s of Amazon EMR clusters per day

• Classic burst-y workload perfect for the cloud

• Big data / HPC clusters themselves are parallelized resources

• Can you build a faster on-premises cluster? Yes, but…

• Usually a shared/contented resource; in cloud, each user/workgroup gets their own

cluster

• Cloud is often the fastest platform based on “MTTJC”

(Mean Time To Job Completion)

Cloud Democratizes Big Data/HPC

• No-obligation use allows for experimentation, prototypes and

operational/business pilots

• Faster time from inception of idea to solution

• Provides a platform that can scale to meet the massive needs of large

data sets

• Bottom line:

• Enables experimentation and innovation without large capital investments

• Improves ROI for Big Data projects

• http://aws.amazon.com/hpc-applications/

Cloud and Big Data/HPC

Try the tutorial: aws.amazon.com/articles/2855

Find out more: aws.amazon.com/big-data

Thank you!

Ben Butler, Sr. Manager, AWS Marketing

washington, d.c.d36cz9buwru1tt.cloudfront.net/145ab-130-emr-for... · aws dynamodb social scale...

Documents

dynamodb gsg

amazon web services: dynamodb 101 with itoc australia - what...

02.amazon dynamodb

utilizando nosql para big data com dynamodb

dynamodb in-depth & developer drill down

dropcam and dynamodb

interop provisioned its network in minutes

conhecendo o dynamodb

february 2016 webinar series - introduction to dynamodb

amazon wallet: increasing performance with dynamodb

nosql and aws dynamodb

dynamodb and amazon cloudsearch

amazon dynamodb lessen's learned by beginner

lab 8 - streaming data to dynamodb - amazon s3 · pdf...

masterclass webinar: amazon dynamodb july 2014

vmware vrops management pack for amazon dynamodb

tvs for vrops - overview presentation - amazon dynamodb

getting started with amazon dynamodb

dynamodb, análisis del paper

deep dive on amazon dynamodb