Download - Amazon Elastic Map Reduce - Ian Meyers
London Hadoop User
Group
Deep experience in building and
operating global web scale systems
About Amazon Web Services
? …get into cloud computing?
How did Amazon…
Utility computing
On demand Pay as you go
Uniform Available
Utility computing
On demand Pay as you go
Uniform Available
Utility computing
Utility computing
On demand Pay as you go
Uniform Available
Compute
Storage
Security Scaling
Database
Networking Monitoring
Messaging
Workflow
DNS
Load Balancing
Backup CDN
No Up-‐Front Capital Expense
Pay Only for What You Use
Self-‐Service Infrastructure
Easily Scale Up and Down
Improve Agility & Time-‐to-‐Market
Low Cost
Deploy
Cloud computing benefits
Traditional IT capacity
ElasNc capacity
Capacity
Time Your IT needs
On and Off Fast Growth
Variable peaks Predictable peaks
ElasNc capacity
ElasNc capacity
On and Off Fast Growth
Predictable peaks Variable peaks
WASTE
CUSTOMER DISSATISFACTION
ElasNc capacity
Fast Growth On and Off
Predictable peaks Variable peaks
Num
ber o
f EC
2 In
stan
ces
4/12/2008 4/14/2008 4/15/2008 4/16/2008 4/18/2008 4/19/2008 4/20/2008 4/17/2008 4/13/2008
40 servers to 5000 in 3 days
EC2 scaled to peak of 5000 instances
“Techcrunched”
Launch of Facebook modification
Steady state of ~40 instances
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & AdministraNon
Networking
Global Infrastructure
Global Infrastructure
Region
US-WEST (N. California) EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC (Singapore)
US-WEST (Oregon)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA PAC (Sydney)
Availability Zone
Global Infrastructure
Customer Needs
• Store Any Amount of Data – Without Capacity Planning
• Perform Complex Analysis on Any Data – Scale on Demand
• Store Data Securely • Decrease Time to Market
– Build Environments Quickly
• Reduce Costs – Reduce Capital Expenditure
• Enable Global Reach
IngesNon | IntegraNon
ElasNc Block Store
High performance block storage device
1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities
IMAGE
Availability 99.99%
Durability 99.999999999%
Is a Web Store Not a file system
No Single Points of Failure Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.095/GB/month
Typical use case
Write once, read many
Limits 100 Buckets, Unlimited Storage, 5TB Objects
Simple Storage Service Highly scalable object storage for the internet
1 byte to 5TB in size 99.999999999% durability
Peak Requests: 830,000+ per second
Total Number of Objects Stored in Amazon S3
14 Billion 40 Billion 102 Billion
762 Billion
262 Billion
1.3 Trillion
Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012
Objects in S3
Glacier Long term object archive
Extremely low cost per gigabyte 99.999999999% durability
ElasNc Block Store
High performance block storage device
1GB to 1TB in size Mount as drives to instances with snapshot/cloning functionalities
IMAGE
Durability 99.999999999%
Designed for Archival Not a file system Vaults & Archives
3-5 Hour Retrieval Time
Paradigm Archive Store
Performance Configurable - Low
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.011/GB/month
Typical use case
Write once, read infrequently
< 10% / Month
Simple Storage Service Highly scalable object storage
1 byte to 5TB in size 99.999999999% durability
Glacier Long term object archive
Extremely low cost per gigabyte 99.999999999% durability
Storage Lifecycle IntegraNon
Structured Data Management
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & AdministraNon
Networking
Database
Relational Database Service Managed Oracle, MySQL & SQL Server
Dynamo DB Managed NOSQL Database
Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse
RDS Dynamo DB
Redshift
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & AdministraNon
Networking
Database
Relational Database Service Database-as-a-Service No need to install or manage database instances Scalable and fault tolerant configurations Integration with Data Pipeline
RDS Dynamo DB
Redshift
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & AdministraNon
Networking
Database
DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with EMR & Hive
RDS Dynamo DB
Redshift
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & AdministraNon
Networking
Database
Redshift Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Extensive Security 2 TB -> 1.6 PB
RDS Dynamo DB
Redshift
Unstructured Data …
Parallel ETL
Elastic MapReduce Managed, elastic Hadoop cluster Integrates with S3 & DynamoDB Leverage Hive & Pig analytics scripts Support for Spot Instances Integrated HBase NOSQL Database
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & AdministraNon
Networking
Application Services
Elastic MapReduce
• AWS Web Console • Command Line
elastic-‐mapreduce -‐-‐create -‐-‐key-‐pair micro -‐-‐region eu-‐west-‐1 -‐-‐name IanMM-‐Test1 -‐-‐num-‐instances 5 -‐-‐instance-‐type m2.4xlarge –alive -‐-‐log-‐uri s3n://meyersi-‐ire/EMR/log
Launching Clusters
• Enabling Tools
elastic-‐mapreduce -‐-‐create -‐-‐key-‐pair micro -‐-‐region eu-‐west-‐1 -‐-‐name IanMM-‐Test1 -‐-‐num-‐instances 5 -‐-‐instance-‐type m2.4xlarge -‐-‐alive
-‐-‐pig-‐interactive -‐-‐pig-‐versions latest -‐-‐hive-‐interactive –-‐hive-‐versions latest -‐-‐hbase -‐-‐log-‐uri s3n://meyersi-‐ire/EMR/log
Launching Clusters
• Hadoop Configuration Bootstrap Action
elastic-‐mapreduce -‐-‐create -‐-‐bootstrap-‐action s3://elasticmapreduce/bootstrap-‐actions/configure-‐hadoop -‐-‐args "-‐s,dfs.block.size=1048576” -‐-‐key-‐pair micro -‐-‐region eu-‐west-‐1 -‐-‐name IanMM-‐Test-‐3 -‐-‐instance-‐group core -‐-‐instance-‐count 2 -‐-‐instance-‐type m2.4xlarge -‐-‐instance-‐group task -‐-‐instance-‐count 2 -‐-‐instance-‐type m2.4xlarge -‐-‐alive -‐-‐pig-‐interactive -‐-‐hive-‐interactive -‐-‐log-‐uri s3n://meyersi-‐ire/EMR/log
Launching Clusters
Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.
Activity: This is a data aggregation, manipulation, or copy that runs on a user-configured schedule.
Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.
Amazon Data Pipeline
Output: S3 file Path: s3://trend-‐data/#{year-‐month-‐day}.csv
AcNvity: EMR Transform Hive Query: user-‐metrics.hql Frequency: Daily
Input: RDS Table Table: User-‐Demographics SQL PrecondiNon: “Select last_update from table“ > #{YY-‐MM-‐DD}
Input: DynamoDB Table Table: User-‐Event-‐Data-‐#{year-‐month}
Success NoNficaNon: [email protected] Failure NoNficaNon: emr-‐[email protected] Delay NoNficaNon: : emr-‐[email protected]
Orchestration with Data Pipeline
Analytics Pipeline
Redshift
S3
RDS
EMR
Data Pipeline
…collect & store
…orchestrate
…process & analyse
Dynamo DB
Benefits only possible in the Cloud
Pay as you Go
Lower Overall Costs
Stop Guessing Capacity
Agility / Speed /
Innovation
Avoid Undifferentiated
Heavy Lifting Go Global in Minutes
✔ ✔ ✔ ✔ ✔ ✔ “Private Cloud” /
On Premises
X X X X X X
Agility & Global Reach
at the Core of EMR
Ease of Operation
Compute Infrastructure
Hadoop ConfiguraNon Local Disk OperaNng System Config
HDFS
Networking
Hive Pig HBase
User Defined Sogware InstallaNon
Ease of Operation
Compute Infrastructure
Hadoop
ConfiguraNon
Local Disk
OperaNng
System Config
HDFS
Networking
Hive Pig
HBase User Defin
ed Sogware Installa
Non
Multiple Hadoop Distributions - Open Source & MapR Clusters Launched with 1 Command Up in 5 Minutes Hard Partitioned per Customer on CPU, Memory and Disk Dynamic Cluster Resizing In any of 8 Regions around the Globe
Lower Overall Costs
Cheaper | Spot Market Management
Lower TCO
June 2013 Study by Accenture Technology Labs Not Sponsored or Funded by Amazon “Accenture assessed the price-‐performance raJo between bare-‐metal Hadoop clusters and Hadoop-‐as-‐a-‐Service on Amazon Web Services…[and] revealed that Hadoop-‐as-‐a-‐Service offers bePer price-‐performance raJo…” hkp://www.accenture.com/us-‐en/Pages/insight-‐hadoop-‐deployment-‐comparison.aspx
• Spot allows customers to bid on unused EC2 capacity
• Spot price based on supply/demand of instance types in an Availability Zone
• Customers are fulfilled when their bid price is higher than the Spot Price
• Instances will be interrupted when the Spot price exceed the bid price
Spot 101 - What are Spot Instances
elastic-mapreduce --add-instance-group TASK --instance-count 100 --bid-price .4
Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption
#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Other EMR + Spot Use Cases § Run entire cluster on Spot for biggest cost savings § Reduce the cost of application testing
#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50% Cost Savings: ~20%
Reducing Hadoop Costs with Spot
Stop Guessing Capacity
Dynamic Clusters
Extend on-premise environments…
with Amazon VPC…
Populate as demand dictates…
Connect over dedicated links…
And turn it off when you are done
EMR is Hadoop…
…cheaper, easier, and more agile
What’s New?
• MapR M7 Introduction • Optimised for HBase Clusters • Failure Recovery • Point in Time Recovery
Snapshotting • Low Latency Hadoop Optimisations • HBase Mirroring • NFS + HDFS • MapR M5 Price Drop
• Support for Pig 0.11.1 • RANK, CUBE & ROLLUP capability • Groovy UDF’s • Support for Guava Functions • Performance Improvements
• Spark/Shark Bootstrap Action • In Memory Hadoop • Spark Scripting (similar to Pig) • Shark Shell with Hive
Interoperability