deep dive: amazon elastic mapreduce
TRANSCRIPT
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Deep Dive: Amazon EMR
Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak)
Jason Timmes—AVP Software Development, Nasdaq
Why Amazon EMR?
Easy to UseLaunch a cluster in minutes
Low CostPay an hourly rate
ElasticEasily add or remove capacity
ReliableSpend less time monitoring
SecureManaged firewalls
FlexibleYou control the cluster
Easy to deploy
AWS Management Console Command Line
or use the EMR API with your favorite SDK
Easy to monitor and debug
Monitor Debug
Integrated with Amazon CloudWatch
Monitor cluster, node, and I/O
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Read data directly into Hive,
Pig, streaming and cascading
from Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real time sources into
batch oriented systems
Multi-application support & automatic
checkpointing
Amazon EMR integration with Amazon Kinesis
The Hadoop ecosystem can run in Amazon EMR
Hue
Amazon S3 and HDFS
Hue
Query Editor
Hue
Job Browser
Leverage Amazon S3 with EMRFS
Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon
EMR clusters with no data loss
• Point multiple Amazon EMR
clusters at the same data in
Amazon S3
EMR
EMR
Amazon
S3
EMRFS makes it easier to use Amazon S3
• Read-after-write consistency
• Very fast list operations
• Error handling options
• Support for Amazon S3 encryption
• Transparent to applications: s3://
EMRFS client-side encryption
Amazon S3
Am
azon S
3 e
ncry
ption c
lients
EM
RF
S e
nable
d fo
r
Am
azon S
3 c
lient-s
ide e
ncry
ptio
n
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
• Iterative workloads
– If you’re processing the same dataset more than once
– Consider using Spark & RDDs for this too
• Disk I/O intensive workloads
• Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
Amazon EMR—Design Patterns
EMR example #1: Batch Processing
GB of logs pushed to
S3 hourlyDaily EMR cluster
using Hive to process
data
Input and output
stored in S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
EMR example #2: Long-running cluster
Data pushed to S3 Daily EMR cluster
ETL data into
database
24/7 EMR cluster running
HBase holds last 2 years of
data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
TBs of logs sent
dailyLogs stored in
Amazon S3
Hive Metastore
on Amazon EMR
EMR example #3: Interactive query
Interactive query using Presto on multi-petabyte warehouse
http://nflx.it/1dO7Pnt
EMR example #4: Streaming data processing
TBs of logs sent
dailyLogs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2
Optimizations for storage
File formats
• Row oriented– Text files
– Sequence files
• Writable object
– Avro data files
• Described by schema
• Columnar format– Object record columnar (ORC)
– Parquet
Logical table
Row oriented
Column oriented
Choosing the right file format
• Processing and query tools– Hive, Impala, and Presto
• Evolution of schema– Avro for Schema and Presto for storage
• File format “splittability”– Avoid JSON/XML Files. Use them as records.
• Compression—block or file
File sizes
• Avoid small files
– Anything smaller than 100 MB
• Each mapper is a single JVM
– CPU time is required to spawn JVMs/mappers
• Fewer files, matching closely to block size
– Fewer calls to S3
– Fewer network/HDFS requests
Dealing with small files
• Reduce HDFS block size, e.g. 1 MB (default is 128 MB)
– --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
• Better: Use S3DistCP to combine smaller files together
– S3DistCP takes a pattern and target path to combine smaller
input files to larger ones
– Supply a target size and compression codec
Compression
• Always compress data files on Amazon S3
– Reduces network traffic between Amazon S3 and
Amazon EMR
– Speeds up your job
• Compress mappers and reducer output
Amazon EMR compresses inter-node traffic with LZO with
Hadoop 1, and Snappy with Hadoop 2
Choosing the right compression
• Time-sensitive, faster compressions are a better choice
• Large amount of data, use space efficient compressions
• Combined workload, use gzip
Algorithm Splittable? Compression ratioCompress +
Decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Cost saving tips for Amazon EMR
• Use S3 as your persistent data store; query it using Presto, Hive, Spark, etc.
• Only pay for compute when you need it
• Use Amazon EC2 Spot Instances to save > 80%
• Use Amazon EC2 Reserved Instances for steady workloads
• Use Amazon CloudWatch alerts to notify you if a cluster is underutilized, then shut it down. E.g. 0 mappers running for > N hours
OPTIMIZING DATAWAREHOUSING COSTSWITH S3 AND EMRJuly 9, 2015
Jason Timmes, AVP of Software DevelopmentNate Sammons, Principal Architect
Document Title32
We make the
world’s capital markets
move faster more efficientmore transparent
Public company
in S&P 500
Develop and run markets globally in
all asset classes
We provide technology, trading,
intelligence and listing services
Intense Operational Focus
on Efficiency and Competitiveness
We provide the infrastructure, tools and strategic
insight to help our customers navigate the complexity of global capital markets and realize their capital ambitions.
Get to know usWe have uniquely transformed our business from predominately a U.S. equities exchange to a global provider of corporate, trading, technology and information solutions.
Document Title33 33
LEADING INDEX PROVIDER WITH
41,000+ INDEXES ACROSS ASSET CLASSES AND
GEOGRAPHIES
Over 10,000 Corporate Clients in
60 countries
Our technology
powers over
70
MARKETPLACES,
regulators, CSDs
and clearing-
houses
in over
50 COUNTRIES
100+ DATA
PRODUCT OFFERINGS
supporting 2.5+ millioninvestment professionals
and users
IN 98 COUNTRIES
26 Markets
3 Clearing Houses
5 Central Securities
Depositories
Lists more than 3,500
companies in 35 countries,
representing more than $8.8
trillion in total market value
IGNITE YOUR AMBITION
CURRENT STATE
• Replaced on-premises data warehouse with Amazon Redshift in 2014• On-premises warehouse held 1 year of data• Amazon Redshift migration yielded 57% cost savings over legacy
solution
• Business now wants more than 1 year of data in warehouse• Currently have Jan. 2014 to present in Amazon Redshift (18 dw1.8xl
nodes)
Optimizing data warehousing costs with S3 and EMR34
THE PROBLEM
• Historical year archives are queried much less frequently• Amazon Redshift is a very competitive price point, but for rarely needed data,
it’s still expensive
• Need a solution that stores historical data once (cheaply), and can use elastic compute directly on that data only as needed
Optimizing data warehousing costs with S3 and EMR35
REQUIREMENTS
• Decouple storage and compute resources• All data is encrypted in flight and at rest• SQL interface to all archived data• Parallel ingest to Amazon Redshift & new (historical years)
warehouse• Transparent usage-based billing to internal departments (clients)• Isolate workloads between clients (no resource contention)• Ideally deliver a platform that can enable different kinds of
compute paradigms over the same data (SQL, stream processing, etc.)
Optimizing data warehousing costs with S3 and EMR36
ARCHITECTURE DIAGRAM
Optimizing data warehousing costs with S3 and EMR37
SEPARATE STORAGE & COMPUTE RESOURCES
• Single source of truth; highly durable encrypted files in S3• Avoids read hotspots; very high concurrent read capability
• Multiple compute clusters isolate workloads for users• Each client runs their own EMR clusters in their own Amazon VPCs
• No contention on compute resources, or even IP address range
• Scale compute up/down as needed (and manages costs)
• Cost allocation using multiple AWS accounts (1 per internal budget)• S3/Hive Metastore costs are only shared infrastructure; extremely cheap
• Consolidated billing makes cost allocations easy and fully transparent
• Run multiple query layers & experiment with new projects• Spark, Drill, etc…
Optimizing data warehousing costs with S3 and EMR38
DATA SECURITY & ENCRYPTION
Current state:• Nasdaq KMS for encryption keys• Key hierarchy uses MySQL, rooted in an HSM cluster• Using the S3 “Encryption Materials Provider” interface
Future State:• Working with InfoSec to evaluate AWS KMS• Nasdaq KMS keys rooted in AWS KMS• Move MySQL DB to Amazon Aurora (encrypted with AWS KMS)• AWS KMS support in AWS services is growing
Optimizing data warehousing costs with S3 and EMR39
ENCRYPTED DATA ACCESS MONITORING/CONTROL
Optimizing data warehousing costs with S3 and EMR40
S3 FILE FORMAT: PARQUET
• Parquet file format: http://parquet.apache.org
• Self-describing columnar file format
• Supports nested structures (Dremel “record shredding” algo)
• Emerging standard data format for Hadoop• Supported by: Presto, Spark, Drill, Hive, Impala, etc.
Optimizing data warehousing costs with S3 and EMR41
PARQUET VS ORC• Evaluated Parquet and ORC (competing open columnar formats)
• ORC encrypted performance is currently a problem• 15x slower vs. unencrypted (94% slower)
• 8 CPUs on 2 nodes: ~900 MB/sec vs. ~60 MB/sec encrypted
• Encrypted Parquet is ~27% slower vs. unencrypted
• Parquet: ~100MB/sec from S3 per CPU core (encrypted)
• DATE field support in Parquet is not ready yet• DATE + Parquet not supported in Presto at this time
• Currently writing DATEs as INTs
– 2015-06-22 => 20150622
Optimizing data warehousing costs with S3 and EMR42
PARQUET DATA CONVERSION & MANAGEMENT
• Generic schema management system• Supports multiple versions of a named table schema
• Data conversion API• Simple Java API to read/write records in Parquet
• Encodes schema version, data transformations in metadata
• Automatically converts DATE to INT for now
– Migrate to new native DATE schema version when available!
• Parquet doesn’t really handle “ALTER TABLE…” migrations• Schema changes require building new Parquet files (via EMR Map-Reduce jobs)
• CSV to Parquet conversion tools• Given a JSON schema definition and CSV, produce Parquet files
• Writes files into a Hive-compliant directory structure
Optimizing data warehousing costs with S3 and EMR43
SQL QUERY LAYER: PRESTO
• A distributed SQL engine developed by Facebook• https://prestodb.io
• Used by Facebook, Netflix, AirBNB, others
• Reads schema definition from Hive, data files from S3 or HDFS
• Recent enterprise support and contributions from Teradata• http://www.teradata.com/Presto
• Data encryption support added by Nasdaq• Working on contributing this back to Presto through github
• Uses Hive Metastore to impose table definitions on S3 files
Optimizing data warehousing costs with S3 and EMR44
STREAM PROCESSING LAYER: SPARK
• One of our clients wants to test stream processing order data
• Spark supports Parquet
• Transparent file decryption prior to Spark data processing layer
• Needs to be tested, but should “just work”
• …we’re hoping to support Hadoop projects today and in the future
Optimizing data warehousing costs with S3 and EMR45
DATA INGEST
• Daily ingest into Amazon Redshift will also write data to S3
• Averaging ~6 B rows/day currently (1.9 TB/day uncompressed)
• Build Parquet files directly from CSV files loaded into Amazon Redshift
• Limit Amazon Redshift to a 1-year rolling window of data• Effectively caps Amazon Redshift costs
• Future enhancement: • Use Presto as a unified SQL access layer to Amazon Redshift and EMR/S3
Optimizing data warehousing costs with S3 and EMR46
New Features
Amazon EMR is now HIPAA-eligible
Planned for next month: EMR 4.0
• Directly configure Hadoop application settings
• Hadoop 2.6, Spark 1.4.0, Hive 1.0, Pig 0.14
• Standard ports and paths (Apache Bigtop standards)
• Quickly configure and launch clusters for standard use cases
Planned for next month: EMR 4.0
• Use the configuration parameter to directly change default settings
for Hadoop ecosystem applications instead of using the “configure-
hadoop” bootstrap action
NEW YORK
Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak)
Jason Timmes—AVP Software Development, Nasdaq