hadoop in the cloud with aws' emr
DESCRIPTION
Quick intro to and walkthrough of the AWS Elastic Map Reduce (EMR) service. Part of a larger course at http://bit.ly/get-hadoopTRANSCRIPT
![Page 1: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/1.jpg)
Hadoop in the Cloud: AWS Elastic Map Reduce
• What is EMR?• How does EMR compare to Hadoop?• Use cases
![Page 2: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/2.jpg)
EMR is an AWS Service
• AWS review helpful to understand• Infiniteskills offers a course!
– http://bit.ly/learn-aws
• AWS constantly changing and evolving
http://aws.amazon.com/documentation/elasticmapreduce/
![Page 3: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/3.jpg)
EMR Overview
• Abstracts out cluster setup & management– Integrated provisioning, tooling, debug, monitoring– AWS constantly tuning and optimizing– Failed nodes automatically re-provisioned by AWS
• Reduced costs– Clusters shut down automatically by default– Excellent for sporadic MapReduce needs
• Integration to AWS– Leverage cost-effective EC2 instances for processing, S3 for storage– Monitoring done via CloudWatch
![Page 4: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/4.jpg)
EMR Architecture
Master Instance Group
EC2
S3
Core Instance Group
EC2EC2
HDFS HDFS
Task Instance Group
EC2 EC2
EC2 EC2
• Master group controls cluster• Core group runs DataNode &
TaskTracker daemons• Task group runs tasks
• Can be added & removed• S3 can be used for data input / output• Master group coordinates core + task
activities and manages cluster state• Core + task instances read / write to /
from S3
![Page 5: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/5.jpg)
EMR AWS Integration
• Datastore pull / push to– RDS– DynamoDB– S3
• Derived data can be stored in RedShift– Via AWS DataPipelines– Further post-processing
• Data can be pre-processed with Kinesis
![Page 6: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/6.jpg)
What you give up with EMR
• Control– Always 2-3 months behind Hadoop releases– Cannot use CDH or HDP releases (although MapR is supported)
• Speed (if you’re not an AWS customer)• Vendor lock-in
![Page 7: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/7.jpg)
EMR Use Cases
• Already AWS customer– Lots of data in S3 / DynamoDB / RDS
• Sporadic MapReduce needs• Proof-of-concepting Hadoop• Ease of use
– Seamless, near-infinite scale– Simple administration
![Page 8: Hadoop in the cloud with AWS' EMR](https://reader033.vdocument.in/reader033/viewer/2022061220/54ba7cec4a7959f2518b4586/html5/thumbnails/8.jpg)
Hadoop in the Cloud: AWS Elastic Map Reduce
• What is EMR?• How does EMR compare to Hadoop?• Benefits & downsides• Use cases