big data with amazon emr - pop-up loft tel aviv

64
Big Data with Amazon EMR Jonathan Fritz Sr. Product Manager, AWS

Upload: amazon-web-services

Post on 13-Apr-2017

595 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Big data with amazon EMR - Pop-up Loft Tel Aviv

Big Data with Amazon EMRJonathan FritzSr. Product Manager, AWS

Page 2: Big data with amazon EMR - Pop-up Loft Tel Aviv

Storage S3, HDFS

YARNCluster Resource Management

BatchMapReduce

InteractiveTez

In MemorySpark

ApplicationsPig, Hive, Cascading, Mahout, Giraph

HB

ase

Pre

sto

Imp

ala

Hadoop 2

BatchMapReduce

Storage S3, HDFS

Hadoop 1

Applications

Page 3: Big data with amazon EMR - Pop-up Loft Tel Aviv

Amazon EMR

Making it easy, secure and cost-effective to run

data processing frameworks on the AWS cloud

Page 4: Big data with amazon EMR - Pop-up Loft Tel Aviv

Amazon EMR

• Managed platform

• Hadoop MapReduce, Spark, Presto,

and more

• Launch clusters in minutes

• Apache Bigtop based distribution

• Leverage the elasticity of the cloud

• Added security features

• Pay by the hour and save with Spot

• Flexibility to customize

• Programmable Infrastructure

Page 5: Big data with amazon EMR - Pop-up Loft Tel Aviv

What do I need to build a cluster ?

1. Choose instances

2. Choose your software

3. Choose your access method

Page 6: Big data with amazon EMR - Pop-up Loft Tel Aviv

Cluster composition

Master Node

Core Instance Group Task Instance

Groups

NameNode (HDFS),

ResourceManager (YARN),

and other components

HDFS DataNode

YARN Node ManagerYARN Node Manager

Page 7: Big data with amazon EMR - Pop-up Loft Tel Aviv

Choice of multiple instances

CPU

c3 family

c4 family

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

m4 family

Machine

Learning

Batch

Processing

In-memory

(Spark &

Presto)

Large HDFS

Or add EBS volumes if you need additional on-cluster storage.

Page 8: Big data with amazon EMR - Pop-up Loft Tel Aviv

Hadoop applications available in EMR

Or, use bootstrap actions to install arbitrary

applications on your cluster!

Page 9: Big data with amazon EMR - Pop-up Loft Tel Aviv

Choose your software - Quick Create

Page 10: Big data with amazon EMR - Pop-up Loft Tel Aviv

Choose your software – Advanced Options

Page 11: Big data with amazon EMR - Pop-up Loft Tel Aviv

Configuration API for custom configs

[

{

"Classification": "core-site",

"Properties": {

"hadoop.security.groups.cache.secs": "250"

}

},

{

"Classification": "mapred-site",

"Properties": {

"mapred.tasktracker.map.tasks.maximum": "2",

"mapreduce.map.sort.spill.percent": "90",

"mapreduce.tasktracker.reduce.tasks.maximum": "5"

}

}

]

Page 12: Big data with amazon EMR - Pop-up Loft Tel Aviv

Use the AWS CLI to easily create clusters:

aws emr create-cluster

--release-label emr-4.3.0

--instance-groups

InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge

InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK for programmatic provisioning:

Page 13: Big data with amazon EMR - Pop-up Loft Tel Aviv

Use Amazon EMR to

separate your

compute and storage.

Page 14: Big data with amazon EMR - Pop-up Loft Tel Aviv

On premises: compute and storage grow together

Tightly coupled

Storage grows along with

compute

Compute requirements vary

Page 15: Big data with amazon EMR - Pop-up Loft Tel Aviv

On premises: Underutilized or scarce resources

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Re-processingWeekly peaks

Steady state

Page 16: Big data with amazon EMR - Pop-up Loft Tel Aviv

On premises: Contention for same resources

Compute

boundMemory

bound

Page 17: Big data with amazon EMR - Pop-up Loft Tel Aviv

Separation of resources creates data silos

Team A

Page 18: Big data with amazon EMR - Pop-up Loft Tel Aviv

On premises: Replication adds to cost

3x

HDFS needs 3x

Multi-Data Center DR

Page 19: Big data with amazon EMR - Pop-up Loft Tel Aviv

Use Amazon EMR to

separate your

compute and storage.

Page 20: Big data with amazon EMR - Pop-up Loft Tel Aviv

EMR can process data from many sources

• Hadoop Distributed File

System (HDFS)

• Amazon S3 (EMRFS)

• Amazon Dynamo DB,

Redshift, Aurora, RDS

• Amazon Kinesis

• Other applications running in

your architecture (Kafka,

ElasticSearch, etc.)

Page 21: Big data with amazon EMR - Pop-up Loft Tel Aviv

Amazon S3 is your persistent data store

11 9’s of durability

$0.03 / GB / Month in US-East

Life Cycle Policies

Available across AZs

Easy access

Amazon S3

Page 22: Big data with amazon EMR - Pop-up Loft Tel Aviv

The EMR Filesystem (EMRFS)

• Allows you to leverage S3 as a file-system for Hadoop

• Streams data directly from S3

• Cluster still uses local disk/HDFS for intermediates

• Better read/write performance and error handling than

open source components

• Optional consistent view for consistent list

• Support for encryption

• Fast listing of objects

Page 23: Big data with amazon EMR - Pop-up Loft Tel Aviv

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Page 24: Big data with amazon EMR - Pop-up Loft Tel Aviv

Going from HDFS to S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Page 25: Big data with amazon EMR - Pop-up Loft Tel Aviv

Benefit 1: Switch off clusters

Amazon S3Amazon S3 Amazon S3

Page 26: Big data with amazon EMR - Pop-up Loft Tel Aviv

Auto-terminate clusters after job completion

Page 27: Big data with amazon EMR - Pop-up Loft Tel Aviv

You can build a pipeline

Submit jobs using:

- EMR Step API

- Oozie

- SSH directly

- Genie (Gateway)

- OSS workflow tools

(i.e. Luigi)

Page 28: Big data with amazon EMR - Pop-up Loft Tel Aviv

You can use Amazon Data Pipeline

Input data

Use EMR to transform

unstructured to

structured data

Push to S3Ingest into

Redshift

Page 29: Big data with amazon EMR - Pop-up Loft Tel Aviv

Run transient or long-running clusters

Page 30: Big data with amazon EMR - Pop-up Loft Tel Aviv

Benefit 2: Resize your cluster to match

workload requirements

Page 31: Big data with amazon EMR - Pop-up Loft Tel Aviv

Resize using the Console, CLI, or API

Page 32: Big data with amazon EMR - Pop-up Loft Tel Aviv

Save costs with EC2 Spot instances

Bid

Price

OD

Price

Page 33: Big data with amazon EMR - Pop-up Loft Tel Aviv

Spot integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Page 34: Big data with amazon EMR - Pop-up Loft Tel Aviv

The Spot Bid Advisor

Page 35: Big data with amazon EMR - Pop-up Loft Tel Aviv

Spot Integration with EMR

• Can provision instances from the Spot Market

• Replaces a spot instance in case of interruption

• Impact of interruption

• Master Node – Can lose the cluster

• Core Node – Can lose data stored in HDFS

• Task Nodes – lose the task (but the task will run elsewhere)

Page 36: Big data with amazon EMR - Pop-up Loft Tel Aviv

Scale up with Spot Instances

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Page 37: Big data with amazon EMR - Pop-up Loft Tel Aviv

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Page 38: Big data with amazon EMR - Pop-up Loft Tel Aviv

Resize Nodes with Spot Instances

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

Page 39: Big data with amazon EMR - Pop-up Loft Tel Aviv

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Page 40: Big data with amazon EMR - Pop-up Loft Tel Aviv

Intelligent scale down

Page 41: Big data with amazon EMR - Pop-up Loft Tel Aviv

Intelligent scale down – HDFS

Page 42: Big data with amazon EMR - Pop-up Loft Tel Aviv

Effectively utilize clusters

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Page 43: Big data with amazon EMR - Pop-up Loft Tel Aviv

Benefit 3: Logical separation of jobs

Hive, Pig,

Cascading

Prod

Presto Ad-Hoc

Amazon S3

Page 44: Big data with amazon EMR - Pop-up Loft Tel Aviv

Benefit 4 : Disaster recovery built-in

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Hive Metastore in

Amazon RDS

Page 45: Big data with amazon EMR - Pop-up Loft Tel Aviv

S3 as a data-lake

Nate Summons, Principle Architect - NASDAQ

Page 46: Big data with amazon EMR - Pop-up Loft Tel Aviv

Monitoring with CloudWatch (or Ganglia)

Page 47: Big data with amazon EMR - Pop-up Loft Tel Aviv

EMR logging to S3 makes logs easily available

Page 48: Big data with amazon EMR - Pop-up Loft Tel Aviv

EMR Security Overview

Page 49: Big data with amazon EMR - Pop-up Loft Tel Aviv

Encryption ComplianceSecurity

Fundamentals

• Identity and Access

Management (IAM) policies,

• Bucket policies

• Access Control Lists (ACLs)

• Query string authentication

• SSL endpoints

• Server Side Encryption

(SSE-S3)

• Server Side Encryption

with KMS provided keys

(coming soon)

• Client-side Encryption

• Buckets access logs

• Lifecycle Management

Policies

• Access Control Lists

(ACLs)

• Versioning & MFA deletes

Page 50: Big data with amazon EMR - Pop-up Loft Tel Aviv

Networking: VPC private subnets

• Use Amazon S3 Endpoints for

connectivity to S3

• Use Managed NAT for connectivity to

other services or the Internet

• Control the traffic using Security Groups

• ElasticMapReduce-Master-Private

• ElasticMapReduce-Slave-Private

• ElasticMapReduce-ServiceAccess

Page 51: Big data with amazon EMR - Pop-up Loft Tel Aviv

Access Control: IAM Users and Roles

• IAM Policies for access to Amazon EMR service (IAM users or federated users)

• AmazonElasticMapReduceFullAccess

• AmazonElasticMapReduceReadOnlyAccess

• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable

actions for Amazon EMR service, like creating EC2 instances.

• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.

Page 52: Big data with amazon EMR - Pop-up Loft Tel Aviv

Data at Rest: S3 client-side encryption

Amazon S3

Am

azo

n S

3 e

ncry

ptio

n c

lien

tsE

MR

FS

en

ab

led

for

Am

azo

n S

3 c

lien

t-sid

e e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 53: Big data with amazon EMR - Pop-up Loft Tel Aviv

Data at Rest: HDFS Transparent Encryption

• HDFS encryption zones - encryption

zone key

• Each File - unique data encryption key

(DEK), which is encrypted (EDEK)

• End-to-end (at-rest and in-transit) when

data is written to an encryption zone

• Uses Hadoop KMS with the Java

Cryptography Extension KeyStore

(JCEKS)

• Easy to configure in EMR config object

Available with Amazon EMR 4.1.0+ release

Page 54: Big data with amazon EMR - Pop-up Loft Tel Aviv

Data in Transit – Depends!

• Hadoop

• HDFS Data Transfer Protocol (dfs.encrypt.data.transfer)

• Hadoop RPC (hadoop.rpc.protection)

• MapReduce

• SSL for encrypted shuffle

• Spark

• SSL for Akka and HTTP (for broadcast and file server)

• SASL - Block transfer service

Page 55: Big data with amazon EMR - Pop-up Loft Tel Aviv

Customer Stories

Page 56: Big data with amazon EMR - Pop-up Loft Tel Aviv

AOL’s Spot Use Case: restate 6 months of

historical data

Availability Zones

10

550EMR Clusters

24,000Spot EC2 Instances

0

10

20

30

40

50

60

70

Timing Comparison

In-House

AWS

Page 57: Big data with amazon EMR - Pop-up Loft Tel Aviv

FINRA saves money with comparable

performance with Hive on Tez with S3

Page 58: Big data with amazon EMR - Pop-up Loft Tel Aviv

Using EMR and cloud capacity for ETL

Page 59: Big data with amazon EMR - Pop-up Loft Tel Aviv

Bridging on-prem and EMR for easy ETL

Page 60: Big data with amazon EMR - Pop-up Loft Tel Aviv
Page 61: Big data with amazon EMR - Pop-up Loft Tel Aviv

Twitter (Answers) uses EMR as the batch layer

in their Lambda architecture

Page 62: Big data with amazon EMR - Pop-up Loft Tel Aviv

Using EMR for batch, streaming, and ad hoc

SmartNews

Page 63: Big data with amazon EMR - Pop-up Loft Tel Aviv

Nasdaq: data lake architecture diagram

Optimizing data warehousing costs with S3 and EMR

Page 64: Big data with amazon EMR - Pop-up Loft Tel Aviv

Jonathan FritzSr. Product Manager, [email protected]