field notes from expeditions in the cloud-(matt wood, amazon web services)

35
FIELD NOTES DR. MATT WOOD GM, PRODUCT STRATEGY FROM EXPEDITIONS IN THE CLOUD

Upload: spark-summit

Post on 13-Aug-2015

452 views

Category:

Data & Analytics


3 download

TRANSCRIPT

FIELD NOTES

DR. MATT WOOD

GM, PRODUCT S TRATEGY

FROM EXPEDITIONS IN THE CLOUD

GOOD MORNINGMATTHEW@AMAZON .COM

@MZA

THANK YOU

TECHNOLOGY ADOPTION

“WHAT IS CLOUD COMPUTING?”

“WHAT IS BIG DATA?”

VOLUME VELOCITY

VARIABILITY

CONSTRAINTS

BOXED IN

CONSTRAINTS

PRODUCTIVITY

Quick provisioning Resource fit

Designed for iteration

Canonical data stores Streaming data Task clusters

SPARK WITH AMAZON EMRProduction workloads on AWS

Security event streaming

Machine learning &ad targeting

Web analytics

Revenue forecasting

Ad targeting & recommendations

PersonalizationPredictive marketing

App search

S3 MLlib on Spark

Clicks, videos,interactives

S3 MLlib on Spark

Clicks, videos,interactives

S3

Customer segments

S3 MLlib on Spark

Clicks, videos,interactives

S3

Customer segments

Forecasting

Ad sales

Map userto ads

S3 MLlib on Spark

Clicks, videos,interactives

S3

Customer segments

Forecasting

Ad sales

Map userto ads

Recommendation Spark cluster

Clicks, videos,interactives

S3 MLlib on Spark

Clicks, videos,interactives

S3

Customer segments

Forecasting

Ad sales

Map userto ads

Recommendation Spark cluster

Clicks, videos,interactives

Kafka

Every click

Node.js app

Elastic Beanstalk

Node.js app Kinesis

Click stream

Elastic Beanstalk

Node.js app Kinesis

Click stream

Elastic Beanstalk

Spark Streaming

EMR cluster

5 minutewindows

100MB

S3

JSON and CSV

Node.js app Kinesis

Click stream

Elastic Beanstalk

Spark Streaming

EMR cluster

5 minutewindows

100MB

S3

JSON and CSV

EMRElasticSearch

Redshift

Node.js app Kinesis

Click stream

Elastic Beanstalk

Spark Streaming

EMR cluster

5 minutewindows

100MB

S3

JSON and CSV

EMRElasticSearch

Redshift

S3

Ad impressions& clicks

24/7 Sparkcluster

ImpressionRDD

S3

Ad impressions& clicks

24/7 Sparkcluster

ImpressionRDD Interactive

dashboard

Revenueforecast

S3

Ad impressions& clicks

24/7 Sparkcluster

ImpressionRDD Interactive

dashboard

Revenueforecast

Click streamlogs

Batch Sparkclusters

Redshift

S3

Ad impressions& clicks

24/7 Sparkcluster

ImpressionRDD Interactive

dashboard

Revenueforecast

Click streamlogs

Batch Sparkclusters

Redshift

Data explorationand testing

PRODUCTIVITY

Quick provisioning

Designed for iteration

Resource fit

SPARK ON EMRNew

Provision and scale managed Spark clusters, as first class citizens

RAPID PROVISIONING OF ELASTIC CLUSTERS

Provision new clusters in minutes

High memory, high CPU, high IO instances

Access cluster instances directly

Clusters run within a VPC

Add or remove capacity on running clusters

DIRECT ACCESS TO DATA ON S3

Access objects directly on Amazon S3

Server-side and client-side encryption with customer controlled keys

Multiple clusters can access canonical data in S3

No need to copy or manage the data on the cluster

Mix S3 and HDFS on a cluster

INTEGRATION WITH THE SPOT MARKET

Bid on under utilized capacity on EC2

“Name your price” clusters

Very low cost at high scale

Lowest cost for time insensitive workloads

Also on-demand and reserve capacity pricing

SPARK ON EMRNew

No additional cost. Available today.

aws.amazon.com/emr/spark

THANK YOUMATTHEW@AMAZON .COM

@MZA