analyzing big data with aws - …d36cz9buwru1tt.cloudfront.net/aws-gov-summit-2011/govsummit... ·...

52
AWS Gov Cloud Summit II Analyzing Big Data with AWS Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota

Upload: ledan

Post on 28-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

AWS Gov Cloud Summit II

Analyzing Big Data with AWS

Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota

AWS Gov Cloud Summit II

What is Big Data?

AWS Gov Cloud Summit II

Computer generated data – Application server logs (web sites, games)

– Sensor data (weather, water, smart grids)

– Images/videos (traffic, security cameras)

AWS Gov Cloud Summit II

• Human generated data

– Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)

– Blogs/Reviews/Emails/Pictures

• Social graphs – Facebook, linked-in, contacts

AWS Gov Cloud Summit II

Big Data is full of valuable, unanswered questions!

AWS Gov Cloud Summit II

Why is Big Data Hard (and Getting Harder)?

AWS Gov Cloud Summit II

• Data Volume – Unconstrained growth

– Current systems don’t scale

Why is Big Data Hard (and Getting Harder)?

AWS Gov Cloud Summit II

Why is Big Data Hard (and Getting Harder)?

• Data Structure – Need to consolidate data from multiple data

sources in multiple formats across multiple businesses

AWS Gov Cloud Summit II

Why is Big Data Hard (and Getting Harder)?

• Changing Data Requirements – Faster response time of fresher data

– Sampling is not good enough

– Increasing complexity of analytics

– Users demand inexpensive experimentation

AWS Gov Cloud Summit II

We need tools built specifically for Big Data!

AWS Gov Cloud Summit II

Innovation #1:

Apache Hadoop The MapReduce computational paradigm

Open source, scalable, fault‐tolerant, distributed system

Hadoop lowers the cost of developing a distributed system for data processing

AWS Gov Cloud Summit II

Innovation #2:

Amazon Elastic Compute Cloud (EC2)

“provides resizable compute capacity in the cloud.”

Amazon EC2 lowers the cost of operating a distributed system for data processing

AWS Gov Cloud Summit II

Amazon Elastic MapReduce =

Amazon EC2 + Hadoop

AWS Gov Cloud Summit II

Elastic MapReduce applications • Targeted advertising / Clickstream analysis

• Security: anti-virus, fraud detection, image recognition

• Pattern matching / Recommendations

• Data warehousing / BI

• Bio-informatics (Genome analysis)

• Financial simulation (Monte Carlo simulation)

• File processing (resize jpegs, video encoding)

• Web indexing

AWS Gov Cloud Summit II

Clickstream Analysis –

• Big Box Retailer came to Razorfish

3.5 billion records

71 million unique cookies

1.7 million targeted ads required per day

Problem: Improve Return on Ad Spend (ROAS)

AWS Gov Cloud Summit II

Targeted Ad

User recently

purchased a

sports movie and

is searching for

video games (1.7 Million per day)

Clickstream Analysis –

AWS Gov Cloud Summit II

• Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop

Clickstream Analysis –

AWS Gov Cloud Summit II

Processing time dropped from 2+ days to 8 hours (with lots more data)

Clickstream Analysis –

AWS Gov Cloud Summit II

Increased Return On Ad Spend by 500%

Clickstream Analysis –

AWS Gov Cloud Summit II

• World’s largest handmade marketplace

– 8.9 million items

– 1 billion page view per month

– $320MM 2010 GMS

AWS Gov Cloud Summit II

• Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes

Production DB snapshots

Web event logs

ETL – Step 1 ETL – Step 2

Job

Job

Job

AWS Gov Cloud Summit II

Recommendations The Taste Test http://www.etsy.com/tastetest

AWS Gov Cloud Summit II

Recommendations

etsy.com/gifts

Gift Ideas for Facebook Friends

AWS Gov Cloud Summit II

AWS Gov Cloud Summit II

Yelp’s Business Generates a Lot of Data

400 GB of logs per day ~12 Terabytes per month

AWS Gov Cloud Summit II

They Frequently Analyze this Data to Power Key Features of their Site

AWS Gov Cloud Summit II

Autocomplete Search

AWS Gov Cloud Summit II

Recommendations

AWS Gov Cloud Summit II

Automatic spelling corrections

AWS Gov Cloud Summit II

Automatic spelling corrections

Let’s take a Look at how this works

AWS Gov Cloud Summit II

Amazon S3

1) Load log file data for

six months of user search

history into Amazon S3

Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR Log Files

2) Spin up a 200 node

cluster of virtual servers

in the cloud

Hadoop Cluster

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR

3) 200 nodes simultaneously

analyze this data looking for

common misspellings

… this takes a few hours

Hadoop Cluster

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR

4) New common

misspellings and

suggestions loaded back

into S3

Hadoop Cluster

Log Files

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR

5) When the job is done,

the cluster is shut down.

Yelp only pays for the time

they used.

Log Files

AWS Gov Cloud Summit II

Each of their 80 developers can do this whenever they have a big data problem to analyze

Log file

data

250 clusters spun up and down every week

AWS Gov Cloud Summit II

AWS Gov Cloud Summit II

Data size

• Global reach

• Native app for almost every smartphone, SMS, web, mobile-web

• 10M+ users, 15M+ venues, ~1B check-ins

• Terabytes of log data

AWS Gov Cloud Summit II

Stack

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs

Dat

a St

ack Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

AWS Gov Cloud Summit II

Computing venue-to-venue similarity

• Spin up 40 node cluster

• Submit Ruby streaming job

– Invert User x Venue matrix

– Grab Co-occurrences

– Compute similarity

• Spin down cluster

• Load data to app server

AWS Gov Cloud Summit II

Who is checking in?

0

0.1

0.2

0.3

0.4

0.5

0.6

Female Male

Gender

0 20 40 60 80

Age

AWS Gov Cloud Summit II

What are people doing?

AWS Gov Cloud Summit II

Where are our users?

AWS Gov Cloud Summit II

When do people go to a place?

Gorilla Coffee

Gray's Papaya

Amorino

Thursday Friday Saturday Sunday

AWS Gov Cloud Summit II

Why are people checking in? • Explore their city, discover new places

• Find friends, meet up

• Save with local deals

• Get insider tips on venues

• Personal analytics, diary

• Follow brands and celebrities

• Earn points, badges, gamification of life

• The list grows…

AWS Gov Cloud Summit II

Over 1000’s customers using EMR

AWS Gov Cloud Summit II

RDBMS vs. MapReduce/Hadoop

• RDBMS Predefined schema

Strategic data placement for query tuning

Exploit indexes for fast retrieving

SQL only

Doesn’t scale linearly

• MapReduce/Hadoop No schema is required

Random data placement

Fast scan of the entire dataset

Uniform query performance

Linearly scales for reads and writes

Support many languages including SQL

Complementary technologies

AWS Gov Cloud Summit II

AWS Data Warehousing Architecture

AWS Gov Cloud Summit II

Elastic Data Warehouse • Customize cluster size to support varying resource needs (e.g. query

support during the day versus batch processing overnight)

• Reduce costs by increasing server utilization

• Improve performance during high usage periods

Expand to 25 instances

Data Warehouse

(Steady State)

Data Warehouse

(Batch Processing)

Shrink to 9 instances

Data Warehouse

(Steady State)

AWS Gov Cloud Summit II

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Reducing Costs with Spot Instances

Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50% Cost Savings: ~22%

AWS Gov Cloud Summit II

Big Data Ecosystem And Tools

We have a rapidly growing ecosystem

• Business Intelligence – MicroStrategy, Pentaho

• Analytics – Datameer, Karmasphere, Quest

• Open source – Ganglia, Squirrel SQL

AWS Gov Cloud Summit II

Thank You!! http://aws.amazon.com/elasticmapreduce/