big data with amazon emr - pop-up loft tel aviv
TRANSCRIPT
Big Data with Amazon EMRJonathan FritzSr. Product Manager, AWS
Storage S3, HDFS
YARNCluster Resource Management
BatchMapReduce
InteractiveTez
In MemorySpark
ApplicationsPig, Hive, Cascading, Mahout, Giraph
HB
ase
Pre
sto
Imp
ala
Hadoop 2
BatchMapReduce
Storage S3, HDFS
Hadoop 1
Applications
Amazon EMR
Making it easy, secure and cost-effective to run
data processing frameworks on the AWS cloud
Amazon EMR
• Managed platform
• Hadoop MapReduce, Spark, Presto,
and more
• Launch clusters in minutes
• Apache Bigtop based distribution
• Leverage the elasticity of the cloud
• Added security features
• Pay by the hour and save with Spot
• Flexibility to customize
• Programmable Infrastructure
What do I need to build a cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
Cluster composition
Master Node
Core Instance Group Task Instance
Groups
NameNode (HDFS),
ResourceManager (YARN),
and other components
HDFS DataNode
YARN Node ManagerYARN Node Manager
Choice of multiple instances
CPU
c3 family
c4 family
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
m4 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS
Or add EBS volumes if you need additional on-cluster storage.
Hadoop applications available in EMR
Or, use bootstrap actions to install arbitrary
applications on your cluster!
Choose your software - Quick Create
Choose your software – Advanced Options
Configuration API for custom configs
[
{
"Classification": "core-site",
"Properties": {
"hadoop.security.groups.cache.secs": "250"
}
},
{
"Classification": "mapred-site",
"Properties": {
"mapred.tasktracker.map.tasks.maximum": "2",
"mapreduce.map.sort.spill.percent": "90",
"mapreduce.tasktracker.reduce.tasks.maximum": "5"
}
}
]
Use the AWS CLI to easily create clusters:
aws emr create-cluster
--release-label emr-4.3.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK for programmatic provisioning:
Use Amazon EMR to
separate your
compute and storage.
On premises: compute and storage grow together
Tightly coupled
Storage grows along with
compute
Compute requirements vary
On premises: Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
On premises: Contention for same resources
Compute
boundMemory
bound
Separation of resources creates data silos
Team A
On premises: Replication adds to cost
3x
HDFS needs 3x
Multi-Data Center DR
Use Amazon EMR to
separate your
compute and storage.
EMR can process data from many sources
• Hadoop Distributed File
System (HDFS)
• Amazon S3 (EMRFS)
• Amazon Dynamo DB,
Redshift, Aurora, RDS
• Amazon Kinesis
• Other applications running in
your architecture (Kafka,
ElasticSearch, etc.)
Amazon S3 is your persistent data store
11 9’s of durability
$0.03 / GB / Month in US-East
Life Cycle Policies
Available across AZs
Easy access
Amazon S3
The EMR Filesystem (EMRFS)
• Allows you to leverage S3 as a file-system for Hadoop
• Streams data directly from S3
• Cluster still uses local disk/HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Optional consistent view for consistent list
• Support for encryption
• Fast listing of objects
Going from HDFS to S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'
Benefit 1: Switch off clusters
Amazon S3Amazon S3 Amazon S3
Auto-terminate clusters after job completion
You can build a pipeline
Submit jobs using:
- EMR Step API
- Oozie
- SSH directly
- Genie (Gateway)
- OSS workflow tools
(i.e. Luigi)
You can use Amazon Data Pipeline
Input data
Use EMR to transform
unstructured to
structured data
Push to S3Ingest into
Redshift
Run transient or long-running clusters
Benefit 2: Resize your cluster to match
workload requirements
Resize using the Console, CLI, or API
Save costs with EC2 Spot instances
Bid
Price
OD
Price
Spot integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
The Spot Bid Advisor
Spot Integration with EMR
• Can provision instances from the Spot Market
• Replaces a spot instance in case of interruption
• Impact of interruption
• Master Node – Can lose the cluster
• Core Node – Can lose data stored in HDFS
• Task Nodes – lose the task (but the task will run elsewhere)
Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14 7)
25% less cost (140 105)
Intelligent scale down
Intelligent scale down – HDFS
Effectively utilize clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical separation of jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3
Benefit 4 : Disaster recovery built-in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Hive Metastore in
Amazon RDS
S3 as a data-lake
Nate Summons, Principle Architect - NASDAQ
Monitoring with CloudWatch (or Ganglia)
EMR logging to S3 makes logs easily available
EMR Security Overview
Encryption ComplianceSecurity
Fundamentals
• Identity and Access
Management (IAM) policies,
• Bucket policies
• Access Control Lists (ACLs)
• Query string authentication
• SSL endpoints
• Server Side Encryption
(SSE-S3)
• Server Side Encryption
with KMS provided keys
(coming soon)
• Client-side Encryption
• Buckets access logs
• Lifecycle Management
Policies
• Access Control Lists
(ACLs)
• Versioning & MFA deletes
Networking: VPC private subnets
• Use Amazon S3 Endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
Access Control: IAM Users and Roles
• IAM Policies for access to Amazon EMR service (IAM users or federated users)
• AmazonElasticMapReduceFullAccess
• AmazonElasticMapReduceReadOnlyAccess
• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable
actions for Amazon EMR service, like creating EC2 instances.
• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.
Data at Rest: S3 client-side encryption
Amazon S3
Am
azo
n S
3 e
ncry
ptio
n c
lien
tsE
MR
FS
en
ab
led
for
Am
azo
n S
3 c
lien
t-sid
e e
ncry
ptio
n
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Data at Rest: HDFS Transparent Encryption
• HDFS encryption zones - encryption
zone key
• Each File - unique data encryption key
(DEK), which is encrypted (EDEK)
• End-to-end (at-rest and in-transit) when
data is written to an encryption zone
• Uses Hadoop KMS with the Java
Cryptography Extension KeyStore
(JCEKS)
• Easy to configure in EMR config object
Available with Amazon EMR 4.1.0+ release
Data in Transit – Depends!
• Hadoop
• HDFS Data Transfer Protocol (dfs.encrypt.data.transfer)
• Hadoop RPC (hadoop.rpc.protection)
• MapReduce
• SSL for encrypted shuffle
• Spark
• SSL for Akka and HTTP (for broadcast and file server)
• SASL - Block transfer service
Customer Stories
AOL’s Spot Use Case: restate 6 months of
historical data
Availability Zones
10
550EMR Clusters
24,000Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
FINRA saves money with comparable
performance with Hive on Tez with S3
Using EMR and cloud capacity for ETL
Bridging on-prem and EMR for easy ETL
Twitter (Answers) uses EMR as the batch layer
in their Lambda architecture
Using EMR for batch, streaming, and ad hoc
SmartNews
Nasdaq: data lake architecture diagram
Optimizing data warehousing costs with S3 and EMR
Jonathan FritzSr. Product Manager, [email protected]