best practices and performance tuning on amazon elastic

48
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect

Upload: votram

Post on 14-Feb-2017

225 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Best Practices and Performance Tuning on Amazon Elastic

©  2016,  Amazon  Web  Services,  Inc.  or  its  Affiliates.  All  rights  reserved.

Amo  AbeyaratneBig  Data  and  Analytics  Consultant  ANZ

12.04.2016

Best  Practices  and  Performance  Tuning  on  Amazon  Elastic  MapReduce

Michael  HanischSolutions  Architect

Page 2: Best Practices and Performance Tuning on Amazon Elastic

Challenge:  Data  is  Everywhere

-­ Sensors-­ Devices-­ Logs-­ Operations-­ …

Page 3: Best Practices and Performance Tuning on Amazon Elastic

Size:  Growing  in  Petabytes

Migrate  &  Collect

Store  &Transform

Process  &  Analyze

Visualize  &  Predict

“If every byte was a word, and you take a second to read a word, it will take you 32 million years to read a whole Petabyte”

Page 4: Best Practices and Performance Tuning on Amazon Elastic

How  do  we  deal  with  it?

Page 5: Best Practices and Performance Tuning on Amazon Elastic

Strategy:  Divide  and  Conquer

Hadoop Amazon  EMR

A  Managed  Hadoop  Framework  in  the  Cloud

Hadoop  on  EC2

Managing  on  your  own  can  be  a    LOT of  work

Hadoop  has  to  scale?

Page 6: Best Practices and Performance Tuning on Amazon Elastic

Tool:  Amazon  EMR

Page 7: Best Practices and Performance Tuning on Amazon Elastic

Agenda

Why  EMR?

Well  Architected  EMR  Design  for  Production

Challenge:  Data  is  EverywhereSize:  Growing  in  PBsStrategy:  Divide  &  ConquerTool:  Amazon  EMR

Page 8: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?

Page 9: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?

Automated Decoupled Elastic

Integrated Low-­costCurrent

Page 10: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?  :  Automation

EC2  Provisioning Cluster  Setup Hadoop  Configuration

Installing  ApplicationsJob  submissionMonitoring  and  Failure  Handling

Page 11: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?  :  Decoupled  Architecture

Separate  compute  and  storage

Resize  and  shutdown  with  no  data  loss

Point  multiple  clusters  at  the  same  data  on  

Amazon  S3

Easily  evolve  infrastructure  as  technology  evolves

HDFS  for  iterative  and  disk  I/O  intensive  

workloadsSave  with  spot  and  reserved  instances

Page 12: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?:  Elastic

Intelligent  resize:  Wait  for  work  to  finish  before  stopping  instances

Core  nodes  for  scaling  HDFS

Task  nodes  for  scaling  processing  

power

Use  instance  groups:  to  manage  different  instance  types  in  the  

same  cluster

Page 13: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?:  Current

Application Open source  release  

EMR  release

Spark  1.5 September  9,  2015 September  2015

Spark  1.5.2 November  9,  2015 November 2015

Spark  1.6 January  4,  2016 January  2016

Spark 1.6.1 March 9,  2016 April  2016

Page 14: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?:  Integration  with  other  services

Amazon  Kinesis

Amazon  S3

AWS  Data  Pipeline

Amazon  DynamoDB

Amazon  Redshift

Amazon  KMS

IAM  for  Authentication

Page 15: Best Practices and Performance Tuning on Amazon Elastic

Why  EMR?: Low-­cost

Spot  instances:  Bid  for  unused  EC2s  at  up  to  90%  less  price

Transient  clusters:  Terminate  the  cluster  when  not  in  use

Reserved  instances:  For  persistent  

clusters,  make  use  of  EC2  reserved  

instances  to  save  up  to  50%

Page 16: Best Practices and Performance Tuning on Amazon Elastic

Agenda

Why  EMR?

Well  Architected  EMR  Design  for  Production

Challenge:  Data  is  EverywhereSize:  Growing  in  PBsStrategy:  Divide  &  ConquerTool:  Amazon  EMR

-­Automated-­Decoupled-­Elastic-­Integrated-­Current-­Cost-­efficient

Page 17: Best Practices and Performance Tuning on Amazon Elastic

Well-­Architected  Amazon  EMR

Page 18: Best Practices and Performance Tuning on Amazon Elastic

Well  Architected  EMR:  Design  for  Production

SecurityReliabilityPerformance Cost  efficiency

Page 19: Best Practices and Performance Tuning on Amazon Elastic

Performance  Efficiency

Choice  of  Instance  Type

Choice  of  Storage Choice  of  application  or  framework

Page 20: Best Practices and Performance Tuning on Amazon Elastic

Performance:  Choice  of  instance  type  -­ Master

Less  than  50  nodes?

Heavy  networkI/O

M3.xlarge

YES

No

C3  family  or  R3  with  Enhanced  networking

YES

M3.2xlarge  or  larger

Page 21: Best Practices and Performance Tuning on Amazon Elastic

Performance:  Choice  of  instance  type  -­ Workers

Compute Memory Storage

Machine  Learning

C1  FamilyC3  FamilyCC1.4xlargeCC2.8xlarge

M2  FamilyR3  FamilyCr1.8xlarge

Interactive  Analysis

D2  FamilyI2  Family

Large  HDFS

General

Batch  Process

M3  FamilyM1  Family

Page 22: Best Practices and Performance Tuning on Amazon Elastic

How  many  nodes  do  I  need?

Page 23: Best Practices and Performance Tuning on Amazon Elastic

Performance:  Sizing

-­ How  many  tasks  will  you  have  to  execute?-­ Based  on  data  size,  number  of  files/splits-­ Based  on  the  job  you  are  evaluating

-­ When  do  you  need  to  get  done?-­ How  much  parallelism  do  you  need?-­ How  many  tasks  can  each  node  process  in  parallel?

Page 24: Best Practices and Performance Tuning on Amazon Elastic

Performance:  Sizing

-­ Which  other  jobs  are  you  running?

-­ How  much  data  do  you  need  to  store  locally?-­ Which  files  are  reused  3x  or  more?

Page 25: Best Practices and Performance Tuning on Amazon Elastic

Performance:  Sizing

Guidelines:

-­ Size  based  on  HDFS  storage  first  if  needed-­ Add  enough  (task)  nodes  to  handle  processing-­ Do  not  add  more  than  5  tasks  nodes  per  core  node-­ Prefer  smaller  clusters    of  larger  machines

Page 26: Best Practices and Performance Tuning on Amazon Elastic

Performance: Tune  it  – e.g.  for  Spark

Enable  Dynamic  Allocation  of  executors

Executor  MemoryExecutor  CoresYARN  containers

Driver  MemoryDeployment  mode  on  YARN  (Cluster  |  Client?)

Page 27: Best Practices and Performance Tuning on Amazon Elastic

EMRFS  (S3) HDFS

Performance:  Choice  of  storage

-­ Ability  to  decouple  -­ Reliable  and  durable-­ Cost  efficient-­ Works  well  for  jobs  that  read  a  

dataset  once  per  run.

-­ Need  a  persistent  cluster-­ Reliability  is  configurable.  But  need  

multiple  nodes  to  achieve  replication  factor.

-­ Great  for  jobs  with  iterative  reads  on  the  same  dataset  like  machine  learning  

Combine  with  S3DistCp and  move  data  from  S3  once  to  HDFS  for  iterative  workloads

- Instancestorage

- AmazonEBS

Page 28: Best Practices and Performance Tuning on Amazon Elastic

Performance:  S3  vs  HDFS  at  Netflix

http://techblog.netflix.com/2014/10/using-­presto-­in-­our-­big-­data-­platform.html

Page 29: Best Practices and Performance Tuning on Amazon Elastic

S3:  Range  GET  – Data  ‘locality’  doesn’t  matter?  

GET  Range  128-­192MB

GET  Range  0-­64MB

GET  Range  64-­128MB

GET  Range  (n-­64)-­nMBEMR  worker  nodes

S3  object  (use  larger  files)

Page 30: Best Practices and Performance Tuning on Amazon Elastic

S3:  Avoid  sequential  key  names

UTF-­8  Binary  ordering

var/ Server/

BS

VWAmo/

100/s 100/s 100/sS3  partition S3  partition S3  partition

Page 31: Best Practices and Performance Tuning on Amazon Elastic
Page 32: Best Practices and Performance Tuning on Amazon Elastic

S3:  Avoid  sequential  key  names

Page 33: Best Practices and Performance Tuning on Amazon Elastic

1.2TB/Day  logs30TB  /Day  data  250  Hadoop  Jobs 75Billion  transactions/Day  

5  Petabytes  of  Data

S3:  Real  world  heavy  EMRFS  users

10+PB  Data  Warehouseon  Amazon  S3>  1PB  read  each  day

Page 34: Best Practices and Performance Tuning on Amazon Elastic

Performance:  Compression

Always  compress  data  on  S3  to  reduce  bandwidth  &  cost.

Page 35: Best Practices and Performance Tuning on Amazon Elastic

Performance: Choice  of  Framework

Count  these  words

Count  =  1 These  =  1 Words  =  1

-­ Embarrassingly  parallel? -­ Can  be  optimized  with  a  DAG?

A

B

C

D

ECount  =  1These  =  1Words  =  1

Page 36: Best Practices and Performance Tuning on Amazon Elastic

Reliability

Externalize  Hive  Metastore

-­ Store  your  metadata  outside  the  cluster  on  RDS

-­ Multi-­AZ  RDS  cluster  will  give  you  HA

Data  and  Applications  on  S3

-­ Source  of  truth  for  data

-­ Store  config &  applications  on  S3,  too

-­ Consistent  View

Automate

-­ Bootstraps-­ Config options-­ Cloudformation

Page 37: Best Practices and Performance Tuning on Amazon Elastic

Reliability: Hive  Metastore on  RDS  MySQL

[

{

"Classification": "hive-site",

"Properties": {

"javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/emr-hive-metastore.cauttwbz9zri.us-east-1.rds.amazonaws.com:3306\/hive?createDatabaseIfNotExist=true",

"javax.jdo.option.ConnectionUserName": "admin",

"javax.jdo.option.ConnectionPassword": "Passw0rd!"

},

"configurations" : []

}

]

Config file  lives  on  s3://<bucketname>/hive-­meta-­config.json

Page 38: Best Practices and Performance Tuning on Amazon Elastic

Security

Encryption

-­ Server  side-­ Client  side-­ HDFS  Transparent

-­ RPC  with  SSL-­ File  system  with  LUKS

IAM  Roles

-­ Secure  Integration  with  AWS  services

VPC

-­ Private  Subnets-­ S3  endpoints-­ NAT

Page 39: Best Practices and Performance Tuning on Amazon Elastic

Security

Encryption

-­ Server  side-­ Client  side-­ HDFS  Transparent

-­ RPC  with  SSL-­ File  system  with  LUKS

S3  Server-­side  encryption  (SSE):-­ Using  S3-­managed  keys  – or  –-­ Using  KMS-­managed  keys  (SSE-­KMS)

-­ Configured  at  cluster  start  time

Page 40: Best Practices and Performance Tuning on Amazon Elastic

Security:  Server-­Side  Encryption  (SSE-­KMS)

Amazon  S3  Bucket

AWS  KMS

EMRFS  with  Client-­side  Encryption

0utput  writes  via  EMRFS  with  Client-­side  Encryption  enabled

Amazon  S3  Bucket

Page 41: Best Practices and Performance Tuning on Amazon Elastic

Security

Encryption

-­ Server  side-­ Client  side-­ HDFS  Transparent

-­ RPC  with  SSL-­ File  system  with  LUKS

S3  Server-­side  encryption  (SSE-­KMS):1. Define  Customer  Master  Key  in  KMS2. Create  key  policy  that  allows  EMR’s  

instance  profile  /  role  to  use  this  key3. Configure  cluster  with  CMK  KeyID or  Alias

Page 42: Best Practices and Performance Tuning on Amazon Elastic

Security: Sample  Key  Policy

{ "Sid": "Allow use of the key",

"Effect": "Allow",

"Principal": {

"AWS": [ "arn:aws:iam::XXXXXXXXXXXX:role/EMR_DefaultRole"]

},

"Action": [ "kms:Encrypt",

"kms:Decrypt",

"kms:ReEncrypt*",

"kms:GenerateDataKey*",

"kms:DescribeKey" ],

"Resource": "*”,

"Condition": {

"ForAnyValue:StringLike": {

"kms:EncryptionContext:aws:s3:arn": [

"arn:aws:s3:::bucket1/*",

"arn:aws:s3:::bucket2/*" ] } }

}

[…]

Page 43: Best Practices and Performance Tuning on Amazon Elastic

Security:  Enable  SSE-­KMS  with  selected  CMK

[ { "Classification":"emrfs-site",

"Properties": {

"fs.s3.enableServerSideEncryption": "true",

"fs.s3.serverSideEncryption.kms.keyId":"a4567b8-9900-12ab-1234-123a45678901"

} } ]

Page 44: Best Practices and Performance Tuning on Amazon Elastic

Security

Encryption

-­ Server  side-­ Client  side-­ HDFS  Transparent

-­ RPC  with  SSL-­ File  system  with  LUKS

Client-­side  encryption:-­ Using  custom  key  materials  provider-­ Configured  at  cluster  start

Page 45: Best Practices and Performance Tuning on Amazon Elastic

Security:  End  to  End  Encryption

Amazon  S3  Bucket

AWS  KMSAWS  S3  SDKAmazonS3EncryptionClient()

Encrypted  Object

EMRFS  with  Client-­side  Encryption

HDFS  transparent  encryption  with  Hadoop  KMS

spark.ssl.enabled hadoop.rpc.protecti onhadoop.ssl.enabl edmapreduce.shuffle.ssl.enabled

0utput  writes  via  EMRFS  with  Client-­side  Encryption  enabled

Amazon  S3  Bucket

LUKS  with  bootstrap  action  for  local  file  systems

Page 46: Best Practices and Performance Tuning on Amazon Elastic

Cost  Efficiency

Spot  instances:  Bid  for  unused  EC2s  at  up  to  90%  less  price

Transient  clusters:  Terminate  the  cluster  when  not  in  use

Reserved  instances:  For  persistent  clusters,  make  use  of  EC2  reserved  instances  to  save  up  to  50%

Matching  Supply  and  Demand• Is  the  cluster  big  enough?• Can  we  make  it  transient?• Monitor  the  usage  with  Ganglia  and  Amazon  CloudWatch alarms

Using  cost-­effective  resources• S3  instead  of  HDFS  for  larger  datasets?

• Taking  advantage  of  Spot  and  Reserved  instances?

Optimize  over  time• Monitor  and  watch  out  for  new  instance  types,  features  that  may  reduce  cost.

Page 47: Best Practices and Performance Tuning on Amazon Elastic

Agenda

Why  EMR?

Well  Architected  EMR  Design  for  Production

Challenge:  Data  is  EverywhereSize:  Growing  in  PBsStrategy:  Divide  &  ConquerTool:  Amazon  EMR

-­Automated-­Decoupled-­Elastic-­Integrated-­Current-­Cost-­efficient

-­Performance  tuning-­Reliability  measures-­Security  facts-­Cost  efficiency

Page 48: Best Practices and Performance Tuning on Amazon Elastic

Thank  you!