(bdt305) amazon emr deep dive and best practices

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rahul Pathak, AWS

Scott Donaldson, FINRA

Clayton Kovar, FINRA

October 2015

Amazon EMR Deep Dive

& Best Practices

BDT305

What to expect from the session

• Update on the latest Amazon EMR release

• Information on advanced capabilities of Amazon EMR

• Tips for lowering your Amazon EMR costs

• Deep dive into how FINRA uses Amazon EMR and

Amazon S3 as their multi-petabyte data warehouse


Rahul Pathak, Sr. Mgr. Amazon EMR

(@rahulpathak)

October 2015

Amazon EMRDeep Dive & Best Practices

Amazon EMR

• Managed clusters for Hadoop, Spark, Presto, or any other applications in the Apache/Hadoop stack

• Integrated with the AWS platform via EMRFS – connectors for Amazon S3, Amazon DynamoDB, Amazon Kinesis, Amazon Redshift, and AWS KMS

• Secure with support for AWS IAM roles, KMS, S3 client-side encryption, Hadoop transparent encryption, Amazon VPC, and HIPAA-eligible

• Built in support for resizing clusters and integrated with the Amazon EC2 spot market to help lower costs

New Features

EMR Release 4.1

• Hadoop KMS with transparent HDFS encryption support

• Spark 1.5, Zeppelin 0.6

• Presto 0.119, Airpal

• Hive, Oozie, Hue 3.7.1

• Simple APIs for launch and configuration

Intelligent Resize

• Incrementally scale up based on available capacity

• Wait for work to complete before resizing down

• Can scale core nodes and HDFS as well as task nodes

Leverage Amazon S3 with

EMR File System (EMRFS)

Amazon S3 as your persistent data store

• Separate compute and storage

• Resize and shut down Amazon EMR clusters with no data loss

• Point multiple Amazon EMR clusters at the same data in Amazon S3

• Easily evolve your analytic infrastructure as technology evolves

EMR

EMR

Amazon

S3

EMRFS makes it easier to use Amazon S3

• Read-after-write consistency

• Very fast list operations

• Error handling options

• Support for Amazon S3 encryption

• Transparent to applications: s3://Amazon

S3

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION ‘samples/pig-apache/input/'

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(

host STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

)

LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Amazon S3EMRFS metadata

in Amazon DynamoDB

List and read-after-write consistency

Faster list operations

Consistent view and fast listing using

the optional EMRFS metadata layer

*Tested using a single node cluster with a m3.xlarge instance.

Number of

objects

Without

consistent view

With

consistent view

1,000,000 147.72 29.70

100,000 12.70 3.69

EMRFS client-side encryption

Amazon S3

Am

azon S

3 e

ncry

ption c

lients

EM

RF

S e

nable

d fo

r

Am

azon S

3 c

lient-s

ide e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

HDFS is still there if you need it

• Iterative workloads

• If you’re processing the same dataset more than once

• Consider using Spark & RDDs for this too

• Disk I/O intensive workloads

• Persist data on Amazon S3 and use S3DistCp to copy to/from

HDFS for processing

Optimizations for storage

File formats

Row oriented• Text files

• Sequence files

• Writable object

• Avro data files

• Described by schema

Columnar format• Object Record Columnar (ORC)

• Parquet

Logical table

Row oriented

Column oriented

Factors to consider

Processing and query tools• Hive, Impala and Presto

Evolution of schema• Avro for schema and Presto for storage

File format “splittability”• Avoid JSON/XML Files. Use them as records

Encryption requirements

File sizes

Avoid small files

• Anything smaller than 100MB

Each mapper is a single JVM

• CPU time is required to spawn JVMs/mappers

Fewer files, matching closely to block size

• fewer calls to S3

• fewer network/HDFS requests

Dealing with small files

Reduce HDFS block size, e.g. 1MB (default is 128MB)

• --bootstrap-action s3://elasticmapreduce/bootstrap-

actions/configure-hadoop --args “-m,dfs.block.size=1048576”

Better: Use S3DistCp to combine smaller files together

• S3DistCp takes a pattern and target path to combine smaller

input files to larger ones

• Supply a target size and compression codec

Compression

Always compress data files On Amazon S3

• Reduces network traffic between Amazon S3 and Amazon EMR

• Speeds Up Your Job

Compress mappers and reducer output

Amazon EMR compresses inter-node traffic with LZO with

Hadoop 1, and Snappy with Hadoop 2

Choosing the right compression

• Time sensitive, faster compressions are a better choice

• Large amount of data, use space efficient compressions

• Combined Workload, use gzip

Algorithm Splittable? Compression ratioCompress +

decompress speed

Gzip (DEFLATE) No High Medium

bzip2 Yes Very high Slow

LZO Yes Low Fast

Snappy No Low Very fast

Cost saving tips for Amazon EMR

Use S3 as your persistent data store – query it using

Presto, Hive, Spark, etc.

Only pay for compute when you need it

Use Amazon EC2 Spot instances to save >80%

Use Amazon EC2 Reserved instances for steady

workloads


Scott Donaldson, Senior Director

Clayton Kovar, Principal Architect

EMR &

Interactive

Analytics

EMR is Ubiquitous in our architecture

Data Marts

(Amazon

Redshift)

Query Cluster

(EMR)

Query Cluster

(EMR)

Auto Scaled

EC2

Analytics

App

Normalization

ETL Clusters

(EMR)

Batch Analytic

Clusters

(EMR)

Adhoc Query

Cluster (EMR)

Auto Scaled

EC2

Analytics

App

Users Data

ProvidersAuto Scaled

EC2

Data

Ingestion

Services

Optimization

ETL Clusters

(EMR)

Shared Metastore

(RDS)

Query Optimized

(S3)

Auto Scaled EC2

Data

Catalog

& Lineage

Services

Reference Data

(RDS)

Shared Data Services

Auto Scaled

EC2

Cluster Mgt

& Workflow

Services

Source of

Truth (S3)

It starts with the data

S3 is your durable system of recordSeparate your compute and storage

Shutdown your cluster when not in use

Share data among multiple clusters

Fault tolerance and disaster recovery

Use EMRFS for consistent view

Partition your data for performanceOptimize for your query use cases and access patterns

Larger files >256MB are more efficient

Compact small files into >100MB

FINRA Data Manager orchestrates

data between storage and compute

clustersUnified catalog

Manage EMR clusters

Track usage & lineage

Job orchestration

File formats & compression

Text format for archival copies on S3 & Amazon Glacier

Select compression algorithm for best fitWe wanted high compression for archive copy

Select a row or columnar format for performanceSequence or AVRO

ORC, Parquet, RC File

Columnar Benefits:Predicate pushdown

Skip unwanted columns

Serve multiple query engines: Hive, Presto, Spark

Avoid bloated formats with repetitive markup (e.g. XML)

Our partition and query strategy

Data

received

as:

Users

query

by:

Symbol Group 1

Symbol Group 2

Symbol Group 3

…

Symbol Group 100

Symbol & Firm Query

Late

Data

All late

records

scanned

for all

queries

On Time Data(Processing Date = Event Date)

99.97% of all records are on time

Symbol Only QueryF

irm O

nly

Qu

ery

Example hive table creation

create external table if not exists NEW_ORDERS (…)partitioned by (EVENT_DT DATE, HASH_PRTN_NB SMALLINT)stored as orclocation 's3://reinvent/new_orders/'tblproperties ("orc.compress"="SNAPPY");

alter table NEW_ORDERS add if not existspartition (event_dt='2015-10-08', hash_prtn_nb=0) location

's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=0/’

…

partition (event_dt='2015-10-08', hash_prtn_nb=1000) location 's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=1000/’

;

Each record’s hash partition number is calculated by

((pmod(hash(symbol), 100) * 10) + pmod(firm, 10))

Made hive on EMR/S3 competitive

Partitions are great, but beware…

select … from NEW_ORDERS where

EVENT_DT between '2015-10-06' and '2015-10-09’

and FIRM = 12345

and (pmod(HASH_PRTN_NB, 10) = pmod(12345, 10)

or HASH_PRTN_NB = 1000) -- 1000 is always read

Using PMOD around the hash_prtn_nb prevents Hive from using a targeted

query on the metastore resulting in millions of partitions returned for pruning

Optimized query with enumeration

select … from NEW_ORDERS where

EVENT_DT >= '2015-10-06' and EVENT_DT <= '2015-10-09’

and FIRM = 12345

and (HASH_PRTN_NB = 5 or HASH_PRTN_NB = 15

… or HASH_PRTN_NB = 985 or HASH_PRTN_NB = 995

or HASH_PRTN_NB = 1000) -- 1000 is always read

Using an IN clause was insufficient to avoid the pruning issue

Explicitly enumerating all partitions vastly improved query planning time

Data security

Required to have encryption of all data both at-rest, and in-transit

S3 server-side encryption was evaluated and determined to be suitable for purpose

Encrypt ephemeral storage on Master, Core, and Task nodes Use a custom bootstrap action with LUKS with a random, memory only key

Task nodes don’t have HDFS but Mapper and Reducer temporary files need to also be encrypted

Lose the server, lose the data – Remember S3 is our source of truth

Use security groups to ensure only the client applications connect to the Master node

Hive authentication/authorization was not necessary for our usage scenarios

Evaluating transparent encryption (Hadoop 2.6+) in HDFS

Selection of the fittest

HDFS was cost prohibitive for our use casesNeed 30 D2.8XL’s just to store two of our tables: ~$1.5M/yr on HDFS vs ~$120K/yr on S3

Need 90 D2.8XL’s to store all queryable data: ~$4.5M/yr on HDFS vs. $360K/yr on S3

Data locality is desirable but not practical for our scale EMR & S3 with partitioned data is a great fit

Tuned queries & data structures on S3 take ~2X if on HDFS under perfect locality conditions

Localize data into HDFS on Core nodes using S3DistCp if making 3 or more passes

Consider tiered storageExternal tables in Hive can have a blend of some partitions in HDFS and others in S3

Introduces operational complexity for partition maintenance

Doesn’t play well with shared metastore for multiple clusters

Darwin rules: Adaptation

Take advantage of new instance types

Find the right instance type(s) for your workload

Prefer a smaller cluster of larger nodes: e.g. 4XL

With millions of partitions, more memory is needed for the Master node (HS2)

Use CLI based scripts rather than console → Infrastructure is code

Node Type Before After

Master 1 - R3.4XL 1 - R3.2XL

Core 40 - M3.2XL 10 - C3.4XL

Task (peak) 100 - M3.2XL 35 - C3.4XL

Beat the incumbent

Right size your cluster

Transient use cases: ETL and batch analyticsSize cluster to complete within ten minutes of an hour boundary to optimize $$

Use Spot when you have flexible SLA to save $$

Use On Demand or Reserved to meet SLA at predictable cost

Always On use case: Interactive analyticsSize Core based on HDFS needs (statistics, logging, etc)

Reserve Master and Core nodes

Resize # of Task nodes as demand changes

Use Spot on Task nodes to save $$

Keep a ratio of Core to Task of 1:5 to avoid bottlenecks

Consider bidding Spot above the On Demand price to ensure greater stability

One metastore to rule them all

Consider creating a shared hive metastore service

Fault tolerance & DR with Multi-AZ RDS

Offload metastore hydration of tables and partitionsTransient clusters initialize faster

Millions of partitions/day can take >7 min/day per table

Avoid duplicative effort by separate development teams

Separate metastores are needed for Hive 0.13.1, Hive 1.0 and PrestoHowever, you can locate them all on a single RDS instance

Utilize FINRA Data Management services to orchestrate metastore updatesRegister new tables and partitions as the data arrives via notifications

Monitor, learn, and optimize

Utilize workload management: Fair Scheduler

Refactor your code as necessary to remove bottlenecks

Optimize transient clusters, size to execute workload 10 minutes from an hour boundary

Set hive.mapred.reduce.tasks.speculative.execution = FALSE when writing to external tables

in S3 via Map Reduce

Use broadcast joins when joining small tables (SET hive.auto.convert.join=true).

EMR Step API works fore simple job queuing; use Oozie for more complex jobs

The impact

Removed obstacles

“Before data analysis of this magnitude required intervention from the technology team.”

Lowered the cost of curiosity

“Analysts are able to quickly obtain a full picture of what happens to an order over time,

helping to inform decision making as to whether a rule violation has occurred.”

Elasticity allows us to process years of data in days as opposed to months and save

money by using Spot market

Separately optimize batch and interactive workloads without compromise

Increased teams delivery velocity

Recap

Use Amazon S3 as your durable system of record

Use transient clusters as much as possible

Resize clusters and use the Spot to more efficiently manage capacity, performance, & cost

Move to new instance families to take advantage of performance

Monitor to determine when to resize or change instance types

Share a persistent Hive metastore in RDS among multiple EMR clusters

Be prepared to switch your query engine or execution framework in the future

Budget time to experiment for new tools & engines at scale that weren’t possible before

Related sessions

BDT208 - A Technical Introduction to Amazon Elastic MapReduce Thursday, Oct 8, 12:15 PM - 1:15 PM– Titian 2201B

BDT303 - Running Spark & Presto on the Netflix Big Data Platform

Thursday, Oct 8, 11:00 AM - 12:00 PM– Palazzo F

BDT309 - Best Practices for Apache Spark on Amazon EMR

Thursday, Oct 8, 5:30 PM - 6:30 PM– Palazzo F

BDT314 - Big Data/Analytics on Amazon EMR & Amazon Redshift

Thursday, Oct 8, 1:30 PM - 2:30 PM– Palazzo F

Remember to complete

your evaluations!

Thank you!

(bdt305) amazon emr deep dive and best practices

Technology