making fly parviz deyhim . obsession scalable, highly-available and secure

53
Making Fly Parviz Deyhim http://bit.ly/sparkemr

Upload: koby-lessley

Post on 14-Dec-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Making Fly

Parviz Deyhim

http://bit.ly/sparkemr

Page 2: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Obsession

Scalable, Highly-available and Secure

Page 3: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Obsession

Scalable, Highly-available and Secure

Page 4: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon Elastic MapReduce

Page 5: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

What is EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Page 6: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

HDFS

Amazon EMR

Integration

Page 7: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

HDFS

Integration

Amazon EMR

Amazon S3 AmazonDynamoDB

Page 8: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

HDFS

Integration Analytics languagesData management

Amazon EMR

Amazon S3 AmazonDynamoDB

Page 9: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

HDFS

Integration Analytics languagesData management

Amazon EMR AmazonRDS

Amazon S3 AmazonDynamoDB

Page 10: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

HDFS

Integration Analytics languagesData management

AmazonRedShift

Amazon EMR AmazonRDS

Amazon S3 AmazonDynamoDB

Data Pipeline

Page 11: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

HDFS

Integration Analytics languagesData management

AmazonRedShift

Amazon EMR AmazonRDS

Amazon S3 AmazonDynamoDB

Data Pipeline

Page 12: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon EMR Concepts

• Master Node

• Core Nodes

• Task Nodes

Page 13: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Core Nodes

Master Node

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

DataNode (HDFS)

Page 14: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Core Nodes

Master Node

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Can Add Core Nodes:

More CPU

More Memory

More HDFS SpaceHDFS

Page 15: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Core Nodes

Master Node

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Can’t remove core nodes:

HDFS corruption

HDFS

Page 16: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Task Nodes

Master Node

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

No HDFS

Provides compute resources:

CPU

Memory

Page 17: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Task Nodes

Master Node

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Can add and remove task nodes

Page 18: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark On Amazon EMR

Page 19: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Bootstrap Actions

• Ability to run or install additional packages/software on EMR nodes

• Simple bash script stored on S3

• Script gets executed during node/instance boot time

• Script gets executed on every node that gets added to the cluster

Page 20: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on Amazon EMR

• Bootstrap action installs Spark on EMR nodes

• Currently on Spark 0.73 & upgrading to 0.8 very soon

http://bit.ly/sparkemr

Page 21: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Why Spark on Amazon EMR?

• Deploy small and large Spark clusters in minutes

• EMR handles node recover in case of failures

• Integration with EC2 Spot Market, Amazon Redshift, Amazon Data pipeline, Amazon Cloudwatch and etc

Page 22: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Why Spark on Amazon EMR?

• Shipping Spark logs to S3 for debugging

– Define S3 bucket at cluster deploy time

Page 23: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

• Bid on un-used EC2 capacity

• Spark is memory hungry.

• Bid on large memory instances with the fraction of the cost

Page 24: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

Instance Type # of nodes

On-demand Cost

Spot Cost Cost/GB Of Memory Using Spot

M1.xlarge 63 $31/h $4.41/h 0.44c/GB/h

CC2.8xlarge 16 $39/h $4.64/h 0.46c/GB/h

M2.4xlarge 15 $24/h $2.25/h 0.22c/GB/h

1TB Memory Cluster

Page 25: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

Master Node

Master instance group

Amazon EMR cluster

HDFS HDFS

32GB Memory

• Launch initial Spark cluster with core nodes

• HDFS to store and checkpoint RDDs

Page 26: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

Master Node

Master instance group

Amazon EMR cluster

HDFS HDFS

32GB Memory 256GB Memory

• Add Task nodes in spot market to increase memory capacity

Page 27: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

Master Node

Master instance group

Amazon EMR cluster

HDFS HDFS

32GB Memory 256GB Memory

Amazon S3

• Create RDDs from HDFS or Amazon S3 with:

sc.textFile OR

sc.sequenceFile

• Run Computation on RDDs

Page 28: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

Master Node

Master instance group

Amazon EMR cluster

HDFS HDFS

32GB Memory 256GB Memory

• Save the resulting RDDs to HDFS or S3 with:

rdd.saveAsSequenceFile OR

rdd.saveAsObjectFile

Amazon S3

saveAsObjectFile

Page 29: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark on EC2 Spot Market

Master Node

Master instance group

Amazon EMR cluster

HDFS HDFS

32GB Memory

• Shutdown TaskNodes when your job is done

Page 30: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Elastic Spark With Amazon EMR

Page 31: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Autoscaling Spark

Master Node

Amazon EMR cluster

HDFS HDFS

32GB Memory

Page 32: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Autoscaling Spark

Master Node

Amazon EMR cluster

HDFS HDFS

32GB Memory 256GB Memory

Page 33: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Elastic Spark

• When to Scale?

– Depends on your job

• CPU bounded or Memory intensive?

– Probably both for Spark jobs

• Use CPU/Memory util. metrics to decide when to scale

Page 34: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon EMR Cloudwatch metrics

• EMR integrates with Cloudwatch

• Provides many metrics.

• Examples:

– Load Metrics– HDFS Metrics– S3 Metrics

Page 35: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon EMR Cloudwatch metrics

Page 36: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Basics on Cloudwatch Metrics

• Pick any Cloudwatch metrics

• Pick a threshold that you like to be notified if its breached

• Setup Cloudwatch Alarms based on your thresholds

• Receive SNS notification in forms of:– Email– SNS– HTTP API Call

Page 37: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Take Manual Action Such AsAdding More Task Nodes

Monitor With Cloudwatch

Receive Email Notification

Basics on Cloudwatch Metrics

Page 38: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Take Automated Actions

Monitor With Cloudwatch HTTP API Calls

Basics on Cloudwatch Metrics

Page 39: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Autoscaling Based on Load

• Setup Cloudwatch alarm on EMR “TotalLoad” metric

• Receive Email/SNS/HTTP notification

• Add more worker nodes by adding EMR task nodes

Page 40: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Autoscaling Based Memory

• Spark needs memory

– Lost of it!!

• How to scale based on the memory usage?

Page 41: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Metrics

• Spark 0.8 provides cluster metrics

• Source and Sink topology

Page 42: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Metrics

• Spark Metric Sources (metrics.properties):

worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource

driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource

executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

Page 43: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Metrics

• Spark Metric Sinks (metrics.properties):• Package: org.apache.spark.metrics.sink

ConsoleSink

JmxSink

CsvSink

GangliaSink

Page 44: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Metrics

• Spark Metric Sinks (metrics.properties):

CloudwatchSink

Page 45: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Metrics & Cloudwatch

Page 46: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Metrics & Cloudwatch

• Monitor Spark metrics with Cloudwatch

• Setup Cloudwatch alarms and get notified if any metrics reached your threshold.

• Example: if JvmHeapUsed > 20G

• Receive notification and take manual or automated actions

Page 47: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Streaming and Amazon Kinesis

Page 48: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon KinesisKinesis

Page 49: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon Kinesis• CreateStream

– Creates a new Data Stream within the Kinesis Service• PutRecord

– Adds new records to a Kinesis Stream• DescribeStream

– Provides metadata about the Stream, including name, status, Shards, etc.

• GetNextRecord– Fetches next record for processing by user business logic

• MergeShard / SplitShard – Scales Stream up/ down

• DeleteStream– Deletes the Stream

Page 50: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Amazon Kinesis

Kinesis

Page 51: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Spark Streaming and Amazon Kinesis

• SparkStreaming Kinesis Receiver

• Extends NetworkReceiver

• Creates a single Receiver per shard and reads from Kinesis

Page 52: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

Misc.

• New AWS instances provide enhanced networking in VPC

• C3• I2

• Higher PPS• Less Jitter• Great CPU Power• Suitable for Spark: Serialization and Shuffle

Page 53: Making Fly Parviz Deyhim . Obsession Scalable, Highly-available and Secure

What Do You Like To See On Spark By Amazon?

Send Feedbacks To:

Parviz [email protected]

@pdeyhim