Download - Masterclass Live: Amazon EMR
![Page 1: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/1.jpg)
Abhishek Sinha – Sr. Product Manager
@abysinha
Amazon EMR
![Page 2: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/2.jpg)
Amazon EMR
Making it easy, secure and cost-effective to run
data processing frameworks on the AWS cloud
![Page 3: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/3.jpg)
Amazon EMR
• Managed platform
• Hadoop MapReduce, Spark, Presto,
and more
• Launch clusters in minutes
• Apache Bigtop based distribution
• Leverage the elasticity of the cloud
• Added security features
• Pay by the hour and save with Spot
• Flexibility to customize
• Programmable Infrastructure
![Page 4: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/4.jpg)
What do I need to build a cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
![Page 5: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/5.jpg)
Cluster composition
Master Node
Core Instance Group Task Instance
Groups
NameNode (HDFS),
ResourceManager (YARN),
and other components
HDFS DataNode
YARN Node ManagerYARN Node Manager
![Page 6: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/6.jpg)
Choice of multiple instances
CPU
c3 family
c4 family
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
m4 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS
Or add EBS volumes if you need additional on-cluster storage.
![Page 7: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/7.jpg)
Hadoop applications available in EMR
Or, use bootstrap actions to install arbitrary
applications on your cluster!
![Page 8: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/8.jpg)
Choose your software - Quick Create
![Page 9: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/9.jpg)
Choose your software – Advanced Options
![Page 10: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/10.jpg)
Configuration API for custom configs
[
{
"Classification": "core-site",
"Properties": {
"hadoop.security.groups.cache.secs": "250"
}
},
{
"Classification": "mapred-site",
"Properties": {
"mapred.tasktracker.map.tasks.maximum": "2",
"mapreduce.map.sort.spill.percent": "90",
"mapreduce.tasktracker.reduce.tasks.maximum": "5"
}
}
]
![Page 11: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/11.jpg)
Use the AWS CLI to easily create clusters:
aws emr create-cluster
--release-label emr-4.3.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK for programmatic provisioning:
![Page 12: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/12.jpg)
Use Amazon EMR to
separate your
compute and storage.
![Page 13: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/13.jpg)
On premises: compute and storage grow together
Tightly coupled
Storage grows along with
compute
Compute requirements vary
![Page 14: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/14.jpg)
On premises: Underutilized or scarce resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
![Page 15: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/15.jpg)
On premises: Contention for same resources
Compute
boundMemory
bound
![Page 16: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/16.jpg)
Separation of resources creates data silos
Team A
![Page 17: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/17.jpg)
On premises: Replication adds to cost
3x
HDFS needs 3x
Multi-Data Center DR
![Page 18: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/18.jpg)
Use Amazon EMR to
separate your
compute and storage.
![Page 19: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/19.jpg)
EMR can process data from many sources
• Hadoop Distributed File
System (HDFS)
• Amazon S3 (EMRFS)
• Amazon Dynamo DB,
Redshift, Aurora, RDS
• Amazon Kinesis
• Other applications running in
your architecture (Kafka,
ElasticSearch, etc.)
![Page 20: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/20.jpg)
Amazon S3 is your persistent data store
11 9’s of durability
$0.03 / GB / Month in US-East
Life Cycle Policies
Available across AZs
Easy access
Amazon S3
![Page 21: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/21.jpg)
The EMR Filesystem (EMRFS)
• Allows you to leverage S3 as a file-system for Hadoop
• Streams data directly from S3
• Cluster still uses local disk/HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Optional consistent view for consistent list
• Support for encryption
• Fast listing of objects
![Page 22: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/22.jpg)
Going from HDFS to S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
![Page 23: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/23.jpg)
Going from HDFS to S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'
![Page 24: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/24.jpg)
Benefit 1: Switch off clusters
Amazon S3Amazon S3 Amazon S3
![Page 25: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/25.jpg)
Auto-terminate clusters after job completion
![Page 26: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/26.jpg)
You can build a pipeline
Submit jobs using:
- EMR Step API
- Oozie
- SSH directly
- Genie (Gateway)
- OSS workflow tools
(i.e. Luigi)
![Page 27: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/27.jpg)
You can use Amazon Data Pipeline
Input data
Use EMR to transform
unstructured to
structured data
Push to S3Ingest into
Redshift
![Page 28: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/28.jpg)
Run transient or long-running clusters
![Page 29: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/29.jpg)
Benefit 2: Resize your cluster to match
workload requirements
![Page 30: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/30.jpg)
Resize using the Console, CLI, or API
![Page 31: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/31.jpg)
Save costs with EC2 Spot instances
Bid
Price
OD
Price
![Page 32: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/32.jpg)
Spot integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
![Page 33: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/33.jpg)
The Spot Bid Advisor
![Page 34: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/34.jpg)
Spot Integration with EMR
• Can provision instances from the Spot Market
• Replaces a spot instance in case of interruption
• Impact of interruption
• Master Node – Can lose the cluster
• Core Node – Can lose data stored in HDFS
• Task Nodes – lose the task (but the task will run elsewhere)
![Page 35: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/35.jpg)
Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
![Page 36: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/36.jpg)
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
![Page 37: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/37.jpg)
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
![Page 38: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/38.jpg)
Resize Nodes with Spot Instances
50 % less run-time ( 14 7)
25% less cost (140 105)
![Page 39: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/39.jpg)
Intelligent scale down
![Page 40: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/40.jpg)
Intelligent scale down – HDFS
![Page 41: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/41.jpg)
Effectively utilize clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
![Page 42: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/42.jpg)
Benefit 3: Logical separation of jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3
![Page 43: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/43.jpg)
Benefit 4 : Disaster recovery built-in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Hive Metastore in
Amazon RDS
![Page 44: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/44.jpg)
S3 as a data-lake
Nate Summons, Principle Architect - NASDAQ
![Page 45: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/45.jpg)
Monitoring with CloudWatch (or Ganglia)
![Page 46: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/46.jpg)
EMR logging to S3 makes logs easily available
![Page 47: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/47.jpg)
![Page 48: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/48.jpg)
Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in RDDs in memory
• Partitioning-aware to avoid
network-intensive shuffle
![Page 49: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/49.jpg)
Spark components to match your use case
![Page 50: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/50.jpg)
Spark speaks your language
![Page 51: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/51.jpg)
Use DataFrames to easily interact with data
• Distributed
collection of data
organized in
columns
• An extension of the
existing RDD API
• Optimized for query
execution
![Page 52: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/52.jpg)
Easily create DataFrames from many formats
RDD
Additional libraries for Spark SQL Data Sources
at spark-packages.org
![Page 53: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/53.jpg)
Load data with the Spark SQL Data Sources API
Additional libraries at spark-packages.org
![Page 54: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/54.jpg)
Sample DataFrame manipulations
![Page 55: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/55.jpg)
Use DataFrames for machine learning
• Spark ML libraries
(replacing MLlib) use
DataFrames as
input/output for
models
• Create ML pipelines
with a variety of
distributed algorithms
![Page 56: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/56.jpg)
Create DataFrames on streaming data
• Access data in Spark Streaming DStream
• Create SQLContext on the SparkContext used for Spark
Streaming application for ad hoc queries
• Incorporate DataFrame in Spark Streaming application
• Checkpointing streaming jobs
![Page 57: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/57.jpg)
Spark Pipeline
![Page 58: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/58.jpg)
Use R to interact with DataFrames
• SparkR package for using R to manipulate DataFrames
• Create SparkR applications or interactively use the SparkR
shell (no Zeppelin support yet - ZEPPELIN-156)
• Comparable performance to Python and Scala
DataFrames
![Page 59: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/59.jpg)
![Page 60: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/60.jpg)
Amazon EMR runs Spark on YARN
• Dynamically share and centrally configure
the same pool of cluster resources across
engines
• Schedulers for categorizing, isolating, and
prioritizing workloads
• Choose the number of executors to use, or
allow YARN to choose (dynamic allocation)
• Kerberos authentication
Storage S3, HDFS
YARNCluster Resource Management
BatchMapReduce
In MemorySpark
ApplicationsPig, Hive, Cascading, Spark Streaming, Spark SQL
![Page 61: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/61.jpg)
Inside Spark Executor on YARN
Max Container size on node
Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)
𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟.𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10
Executor Container
Memory
Overhead
Config File: spark-default.conf
![Page 62: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/62.jpg)
Inside Spark Executor on YARN
Max Container size on node
Spark executor memory - Amount of memory to use per executor process
spark.executor.memory
Executor Container
Memory
Overhead
Spark Executor Memory
Config File: spark-default.conf
![Page 63: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/63.jpg)
Inside Spark Executor on YARN
Max Container size on node
Shuffle Memory Fraction – pre-Spark 1.6
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Default: 0.2
![Page 64: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/64.jpg)
Inside Spark Executor on YARN
Max Container size on node
Storage storage Fraction - pre-Spark 1.6
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFractionStorage
memoryFraction
Default: 0.6
![Page 65: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/65.jpg)
Inside Spark Executor on YARN
Max Container size on node
In Spark 1.6+, Spark automatically balances the amount of memory for execution
and cached data.
Executor Container
Memory
Overhead
Spark Executor Memory
Execution / Cache
Default: 0.6
![Page 66: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/66.jpg)
Dynamic Allocation on YARN
Scaling up on executors
- Request when you want the job to complete faster
- Idle resources on cluster
- Exponential increase in executors over time
New Default beginning EMR 4.4
![Page 67: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/67.jpg)
Dynamic allocation setup
Optional
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBacklog
Timeout
5s
![Page 68: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/68.jpg)
Compress your input data set
• Always compress Data Files on Amazon S3
• Reduces storage cost
• Reduces bandwidth between Amazon S3 and Amazon
EMR, which can speed up bandwidth constrained jobs
![Page 69: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/69.jpg)
Compressions
Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm % Space
Remaining
Encoding Speed Decoding Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
![Page 70: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/70.jpg)
Data Serialization
• Data is serialized when cached or shuffled
Default: Java serializer
• Kyro serialization (10x faster than Java serialization)
• Does not support all Serializable types
• Register the class in advance
Usage: Set in SparkConf
conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
![Page 71: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/71.jpg)
Running Spark on
Amazon EMR
![Page 72: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/72.jpg)
Focus on deriving insights from your data
instead of manually configuring clusters
Easy to install and configure Spark
Secured
Spark submit, Oozie or use Zeppelin UI
Quickly addand remove capacity
Hourly, reserved, or EC2 Spot pricing
Use S3 to decouplecompute and storage
![Page 73: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/73.jpg)
Launch the latest Spark version
Spark 1.6.1 is the current version on EMR.
< 3 week cadence with latest open source release
![Page 74: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/74.jpg)
Create a fully configured cluster in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
![Page 75: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/75.jpg)
Or easily change your settings
![Page 76: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/76.jpg)
Many storage layers to choose from
Amazon DynamoDB
EMR-DyanmoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
ElasticSearch
connector
Amazon Redshift
Spark-Redshift
connector
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
![Page 77: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/77.jpg)
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Amazon EMR
![Page 78: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/78.jpg)
Easy to run your Spark workloads
Amazon EMR Step API
SSH to master node and use Spark
Submit, Oozie or Zeppelin
Submit a Spark
application
Amazon EMR
![Page 79: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/79.jpg)
Customer use cases
![Page 80: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/80.jpg)
Some of our customers running Spark on EMR
![Page 81: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/81.jpg)
![Page 82: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/82.jpg)
Integration Pattern – ETL with Spark
Amazon EMRAmazon S3
HDFSRead
Unstructured
Data
Write
Structured
Extract
Load from
HDFS
Store Output Data
![Page 83: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/83.jpg)
Integration Pattern – Tumbling Window Reporting
Amazon EMR
Amazon
Kinesis
Streaming Input
HDFS
Tumbling/Fixed
Window
Aggregation
Periodic Output
Amazon Redshift
COPY from EMR
Or checkpoint to S3 and use
the Lambda loader app
![Page 84: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/84.jpg)
EMR Security Overview
![Page 85: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/85.jpg)
Encryption ComplianceSecurity
Fundamentals
• Identity and Access
Management (IAM) policies,
• Bucket policies
• Access Control Lists (ACLs)
• Query string authentication
• SSL endpoints
• Server Side Encryption
(SSE-S3)
• Server Side Encryption
with KMS provided keys
(coming soon)
• Client-side Encryption
• Buckets access logs
• Lifecycle Management
Policies
• Access Control Lists
(ACLs)
• Versioning & MFA deletes
![Page 86: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/86.jpg)
Networking: VPC private subnets
• Use Amazon S3 Endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
![Page 87: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/87.jpg)
Access Control: IAM Users and Roles
• IAM Policies for access to Amazon EMR service (IAM users or federated users)
• AmazonElasticMapReduceFullAccess
• AmazonElasticMapReduceReadOnlyAccess
• IAM Policies for Amazon EMR cluster• Service role (AmazonElasticMapReduceRole) - Allowable
actions for Amazon EMR service, like creating EC2 instances.
• Instance profile (AmazonElasticMapReduceforEC2Role) -Applications that run on Amazon EMR, like access to Amazon S3 for EMRFS on your cluster.
![Page 88: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/88.jpg)
Data at Rest: S3 client-side encryption
Amazon S3
Am
azo
n S
3 e
ncry
ptio
n c
lien
tsE
MR
FS
en
ab
led
for
Am
azo
n S
3 c
lien
t-sid
e e
ncry
ptio
n
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
![Page 89: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/89.jpg)
Customer Stories
![Page 90: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/90.jpg)
AOL’s Spot Use Case: restate 6 months of
historical data
Availability Zones
10
550EMR Clusters
24,000Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
![Page 91: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/91.jpg)
OUR CLOUD ARCHITECTURE
![Page 92: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/92.jpg)
FINRA saves money with comparable
performance with Hive on Tez with S3
![Page 93: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/93.jpg)
Using EMR and cloud capacity for ETL
![Page 94: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/94.jpg)
Bridging on-prem and EMR for easy ETL
![Page 95: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/95.jpg)
![Page 96: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/96.jpg)
Twitter (Answers) uses EMR as the batch layer
in their Lambda architecture
![Page 97: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/97.jpg)
Using EMR for batch, streaming, and ad hoc
SmartNews
![Page 98: Masterclass Live: Amazon EMR](https://reader031.vdocument.in/reader031/viewer/2022030317/587110231a28abac6d8b58db/html5/thumbnails/98.jpg)
Nasdaq: data lake architecture diagram
Optimizing data warehousing costs with S3 and EMR