big data with amazon emr - pop-up loft tel aviv

Big Data with Amazon EMRJonathan FritzSr. Product Manager, AWS

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

InteractiveTez

In MemorySpark

ApplicationsPig, Hive, Cascading, Mahout, Giraph

HB

ase

Pre

sto

Imp

ala

Hadoop 2

BatchMapReduce

Storage S3, HDFS

Hadoop 1

Applications

Amazon EMR

Making it easy, secure and cost-effective to run

data processing frameworks on the AWS cloud

Amazon EMR

• Managed platform

• Hadoop MapReduce, Spark, Presto,

and more

• Launch clusters in minutes

• Apache Bigtop based distribution

• Leverage the elasticity of the cloud

• Added security features

• Pay by the hour and save with Spot

• Flexibility to customize

• Programmable Infrastructure

What do I need to build a cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

Cluster composition

Master Node

Core Instance Group Task Instance

Groups

NameNode (HDFS),

ResourceManager (YARN),

and other components

HDFS DataNode

YARN Node ManagerYARN Node Manager

Choice of multiple instances

CPU

c3 family

c4 family

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

m4 family

Machine

Learning

Batch

Processing

In-memory

(Spark &

Presto)

Large HDFS

Or add EBS volumes if you need additional on-cluster storage.

Hadoop applications available in EMR

Or, use bootstrap actions to install arbitrary

applications on your cluster!

Choose your software - Quick Create

Choose your software – Advanced Options

Configuration API for custom configs

[

{

"Classification": "core-site",

"Properties": {

"hadoop.security.groups.cache.secs": "250"

}

},

{

"Classification": "mapred-site",

"Properties": {

"mapred.tasktracker.map.tasks.maximum": "2",

"mapreduce.map.sort.spill.percent": "90",

"mapreduce.tasktracker.reduce.tasks.maximum": "5"

}

}

]

Use the AWS CLI to easily create clusters:

aws emr create-cluster

--release-label emr-4.3.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK for programmatic provisioning:

Use Amazon EMR to

separate your

compute and storage.

On premises: compute and storage grow together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

On premises: Underutilized or scarce resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

On premises: Contention for same resources

Compute

boundMemory

bound

Separation of resources creates data silos

Team A

On premises: Replication adds to cost

3x

HDFS needs 3x

Multi-Data Center DR

Use Amazon EMR to

separate your

compute and storage.

EMR can process data from many sources

• Hadoop Distributed File

System (HDFS)

• Amazon S3 (EMRFS)

• Amazon Dynamo DB,

Redshift, Aurora, RDS

• Amazon Kinesis

• Other applications running in

your architecture (Kafka,

ElasticSearch, etc.)

Amazon S3 is your persistent data store

11 9’s of durability

$0.03 / GB / Month in US-East

Life Cycle Policies

Available across AZs

Easy access

Amazon S3

The EMR Filesystem (EMRFS)

• Allows you to leverage S3 as a file-system for Hadoop

• Streams data directly from S3

• Cluster still uses local disk/HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Optional consistent view for consistent list

• Support for encryption

• Fast listing of objects

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Benefit 1: Switch off clusters

Amazon S3Amazon S3 Amazon S3

Auto-terminate clusters after job completion

You can build a pipeline

Submit jobs using:

- EMR Step API

- Oozie

- SSH directly

- Genie (Gateway)

- OSS workflow tools

(i.e. Luigi)

You can use Amazon Data Pipeline

Input data

Use EMR to transform

unstructured to

structured data

Push to S3Ingest into

Redshift

Run transient or long-running clusters

Benefit 2: Resize your cluster to match

workload requirements

Resize using the Console, CLI, or API

Save costs with EC2 Spot instances

Bid

Price

OD

Price

Spot integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

The Spot Bid Advisor

Spot Integration with EMR

• Can provision instances from the Spot Market

• Replaces a spot instance in case of interruption

• Impact of interruption

• Master Node – Can lose the cluster

• Core Node – Can lose data stored in HDFS

• Task Nodes – lose the task (but the task will run elsewhere)

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Resize Nodes with Spot Instances

Add 10 more nodes on Spot


20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105


50 % less run-time ( 14 7)

25% less cost (140 105)

Intelligent scale down

Intelligent scale down – HDFS

Effectively utilize clusters

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Benefit 3: Logical separation of jobs

Hive, Pig,

Cascading

Prod

Presto Ad-Hoc

Amazon S3

Benefit 4 : Disaster recovery built-in

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Hive Metastore in

Amazon RDS

S3 as a data-lake

Nate Summons, Principle Architect - NASDAQ

Monitoring with CloudWatch (or Ganglia)

EMR logging to S3 makes logs easily available

EMR Security Overview

Encryption ComplianceSecurity

Fundamentals

• Identity and Access

Management (IAM) policies,

• Bucket policies

• Access Control Lists (ACLs)

• Query string authentication

• SSL endpoints

• Server Side Encryption

(SSE-S3)

• Server Side Encryption

with KMS provided keys

(coming soon)

• Client-side Encryption

• Buckets access logs

• Lifecycle Management

Policies

• Access Control Lists

(ACLs)

• Versioning & MFA deletes

Networking: VPC private subnets

• Use Amazon S3 Endpoints for

connectivity to S3

• Use Managed NAT for connectivity to

other services or the Internet

• Control the traffic using Security Groups

• ElasticMapReduce-Master-Private

• ElasticMapReduce-Slave-Private

• ElasticMapReduce-ServiceAccess

Access Control: IAM Users and Roles

• IAM Policies for access to Amazon EMR service (IAM users or federated users)

• AmazonElasticMapReduceFullAccess

• AmazonElasticMapReduceReadOnlyAccess

• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable

actions for Amazon EMR service, like creating EC2 instances.

• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.

Data at Rest: S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Data at Rest: HDFS Transparent Encryption

• HDFS encryption zones - encryption

zone key

• Each File - unique data encryption key

(DEK), which is encrypted (EDEK)

• End-to-end (at-rest and in-transit) when

data is written to an encryption zone

• Uses Hadoop KMS with the Java

Cryptography Extension KeyStore

(JCEKS)

• Easy to configure in EMR config object

Available with Amazon EMR 4.1.0+ release

Data in Transit – Depends!

• Hadoop

• HDFS Data Transfer Protocol (dfs.encrypt.data.transfer)

• Hadoop RPC (hadoop.rpc.protection)

• MapReduce

• SSL for encrypted shuffle

• Spark

• SSL for Akka and HTTP (for broadcast and file server)

• SASL - Block transfer service

Customer Stories

AOL’s Spot Use Case: restate 6 months of

historical data

Availability Zones

10

550EMR Clusters

24,000Spot EC2 Instances

0

10

20

30

40

50

60

70

Timing Comparison

In-House

AWS

FINRA saves money with comparable

performance with Hive on Tez with S3

Using EMR and cloud capacity for ETL

Bridging on-prem and EMR for easy ETL

Twitter (Answers) uses EMR as the batch layer

in their Lambda architecture

Using EMR for batch, streaming, and ad hoc

SmartNews

Nasdaq: data lake architecture diagram

Optimizing data warehousing costs with S3 and EMR

Jonathan FritzSr. Product Manager, [email protected]

big data with amazon emr - pop-up loft tel aviv

Technology