deep dive – amazon elastic mapreduce...avro for schema and orc/parquet for storage file format...

$: Deep Dive – Amazon Elastic MapReduce...Avro for schema and Orc/Parquet for storage File format “splittability” Avoid JSON/XML files with newlines - default Split is \n Compression$
Deep Dive – Amazon Elastic MapReduce

Ian Meyers, Solution Architect – Amazon Web Services

Guest Speakers: Ian McDonald & James Aley - Swiftkey ([email protected], [email protected])

Agenda

Amazon Elastic MapReduce (EMR) Leverage Amazon Simple Storage Service (S3) with Amazon EMR File System (EMRFS) Design patterns and optimizations Space Ape Games

Amazon Elastic MapReduce

Why Amazon EMR?

Easy to Use Launch a cluster in minutes

Low Cost Pay an hourly rate

Elastic Easily add or remove capacity

Reliable Spend less time monitoring

Secure Manage firewalls

Flexible Control the cluster

Easy to deploy

AWS Management Console Command Line

Or use the Amazon EMR API with your favorite SDK.

Easy to monitor and debug

Integrated with Amazon CloudWatch Monitor Cluster, Node, IO, Hadoop 1 & 2 Processes

Monitor Debug

Amazon S3 and HDFS Browser

Query Editor

Job Browser

Try different configurations to find the optimal cost/performance balance.

CPU c3 family

cc2.8xlarge d2 family

Memory m2 family r3 family

Disk/IO d2 family i2 family

General m1 family m3 family

Choose your instance types

ETL ML Spark HDFS

Easy to add and remove compute capacity on your cluster.

Match compute demands with cluster sizing.

Resizable clusters

Spot Instances for task nodes

Up to 90% off Amazon EC2

on-demand pricing

On-demand for core nodes

Standard Amazon EC2

pricing for on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Use bootstrap actions to install applications…

https://github.com/awslabs/emr-bootstrap-actions

…or to configure Hadoop

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided)

Configuration File Name

Configuration File Keyword

File Name Shortcut

Key-Value Pair Shortcut

core-site.xml core C chdfs-site.xml hdfs H hmapred-site.xml mapred M myarn-site.xml yarn Y y

  Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams

  No intermediate data persistence required

  Simple way to introduce real-time sources into batch-oriented systems

  Multi-application support and automatic checkpointing

Amazon EMR Integration with Amazon Kinesis

Leverage Amazon S3

Amazon S3 as your persistent data store

Amazon S3 Designed for 99.999999999% durability Separate compute and storage

Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options Transparent to applications – just read/write to “s3://” Consistent view

For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing of large prefixes via EMRFS metadata

EMRFS support for Amazon S3 client-side encryption

Amazon S3

Amaz

on S

3 en

cryp

tion

clie

nts

EMRFS enabled for

Amazon S3 client-side encryption

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Amazon S3 EMRFS metadata in Amazon DynamoDB

List and read-after-write consistency Faster list operations

Number of objects

Without Consistent View

With Consistent View

1,000,000 147.72 29.70

100,000 12.70 3.69

Fast listing of Amazon S3 objects using EMRFS metadata

*Tested using a single node cluster with a m3.xlarge instance.

Optimize to leverage HDFS

Iterative workloads If you’re processing the same dataset more than once

Disk I/O intensive workloads

Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing.

Amazon EMR – Design Patterns

Amazon EMR example #1: Batch processing

GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR

cluster using Hive to process data

Input and output stored in Amazon S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

Amazon EMR example #2: HBase Server Cluster

Data pushed to Amazon S3

Daily Amazon EMR cluster Extract, Transform, and Load

(ETL) data into database

24/7 Amazon EMR cluster running HBase holds last 2

years’ worth of data

Front-end service uses HBase cluster to power

dashboard with high concurrency

Amazon EMR example #3: Interactive query

TBs of logs sent daily Logs stored in

Amazon S3 Amazon EMR cluster using Presto for ad hoc

analysis of entire log set

Interactive query using Presto on multi-petabyte warehouse

http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html

Optimizations for Storage

File formats

Row oriented Text files Sequence files

Writable object Avro data files

Described by schema

Columnar format Object Record Columnar (ORC) Parquet

Logical Table

Row oriented

Column oriented

Choosing the right file format

Processing and query tools Hive, Pig, Impala, Presto, Spark

Evolution of schema Avro for schema and Orc/Parquet for storage

File format “splittability” Avoid JSON/XML files with newlines - default Split is \n

Compression Block, File or Internal

File sizes

Avoid small files Avoid anything smaller than 100 MB

Each process is a single Java Virtual machine (JVM) CPU time is required to spawn JVMs

Fewer files, matching closely to block size Fewer calls to Amazon S3 Fewer network/HDFS requests

Dealing with small files

You *can* reduce HDFS block size (e.g., 1 MB [default is 128 MB])

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args “-m,dfs.block.size=1048576”

Instead use S3DistCp to combine small files together

S3DistCp takes a pattern and target path to combine smaller input files into larger ones Supply a target size and compression codec

Compression

Always compress data files on Amazon S3 Reduces network traffic between Amazon S3 and Amazon EMR Speeds up your job

Compress Mapper and Reducer output Amazon EMR compresses internode traffic with LZO on Hadoop 1, and Snappy on Hadoop 2.

Choosing the right compression

Time sensitive: faster compressions are a better choice (Snappy) Large amount of data: use space-efficient compressions (Gzip) Combined workload: use LZO.

Algorithm Splittable? Compression Ratio Compress + Decompress Speed

Gzip (DEFLATE) No High Medium

bzip2 Yes Very high Slow

LZO Yes Low Fast

Snappy No Low Very fast

Cost-saving tips

Use Amazon S3 as your persistent data store (only pay for compute when you need it!). Use Amazon EC2 Spot Instances (especially with task nodes) to save 80 percent or more on the Amazon EC2 cost. Use Amazon EC2 Reserved Instances if you have steady workloads. Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).

Cost-saving tips

Contact your account manager about custom pricing options, if you are spending more than $10K per month on Amazon EMR.

What is SwiftKey?

Architecture

Data Capture Architecture

ETL Architecture

Cascalog

•  Cascalog is an open source Clojure library implemented using Cascading

•  Using instead of things like Hive / Pig •  Write a few lines of Clojure and end up with an

EMR job

Parquet (Apache project)

•  Developed by Cloudera and Twitter •  Efficient compression and encoding on Hadoop /

EMR •  Use for storing and processing our data

Things we’ve learnt

Lessons

•  Get on top of serialisation •  Don’t just stick with JSON / Gzip •  Many small files in S3 painful – rebuild to fewer

bigger files •  Use spot instances for EMR (except Master

node) •  Experiment with different instance types to find

best speed / cost effectiveness

What’s Next

Apache Spark

•  Easier / faster than Hadoop or database queries •  Processes in RAM •  Directly against S3 data •  Available on EMR •  Not necessarily great for big joins

LONDON

deep dive – amazon elastic mapreduce...avro for schema and orc/parquet for storage file format...

Documents