analyzing big data with aws

52
AWS Gov Cloud Summit II Analyzing Big Data with AWS Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota

Upload: others

Post on 12-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Analyzing Big Data with AWS

Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota

Page 2: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

What is Big Data?

Page 3: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Computer generated data – Application server logs (web sites, games)

– Sensor data (weather, water, smart grids)

– Images/videos (traffic, security cameras)

Page 4: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

• Human generated data

– Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)

– Blogs/Reviews/Emails/Pictures

• Social graphs – Facebook, linked-in, contacts

Page 5: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Big Data is full of valuable, unanswered questions!

Page 6: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Why is Big Data Hard (and Getting Harder)?

Page 7: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

• Data Volume – Unconstrained growth

– Current systems don’t scale

Why is Big Data Hard (and Getting Harder)?

Page 8: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Why is Big Data Hard (and Getting Harder)?

• Data Structure – Need to consolidate data from multiple data

sources in multiple formats across multiple businesses

Page 9: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Why is Big Data Hard (and Getting Harder)?

• Changing Data Requirements – Faster response time of fresher data

– Sampling is not good enough

– Increasing complexity of analytics

– Users demand inexpensive experimentation

Page 10: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

We need tools built specifically for Big Data!

Page 11: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Innovation #1:

Apache Hadoop The MapReduce computational paradigm

Open source, scalable, fault‐tolerant, distributed system

Hadoop lowers the cost of developing a distributed system for data processing

Page 12: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Innovation #2:

Amazon Elastic Compute Cloud (EC2)

“provides resizable compute capacity in the cloud.”

Amazon EC2 lowers the cost of operating a distributed system for data processing

Page 13: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Amazon Elastic MapReduce =

Amazon EC2 + Hadoop

Page 14: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Elastic MapReduce applications • Targeted advertising / Clickstream analysis

• Security: anti-virus, fraud detection, image recognition

• Pattern matching / Recommendations

• Data warehousing / BI

• Bio-informatics (Genome analysis)

• Financial simulation (Monte Carlo simulation)

• File processing (resize jpegs, video encoding)

• Web indexing

Page 15: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Clickstream Analysis –

• Big Box Retailer came to Razorfish

3.5 billion records

71 million unique cookies

1.7 million targeted ads required per day

Problem: Improve Return on Ad Spend (ROAS)

Page 16: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Targeted Ad

User recently

purchased a

sports movie and

is searching for

video games (1.7 Million per day)

Clickstream Analysis –

Page 17: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

• Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop

Clickstream Analysis –

Page 18: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Processing time dropped from 2+ days to 8 hours (with lots more data)

Clickstream Analysis –

Page 19: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Increased Return On Ad Spend by 500%

Clickstream Analysis –

Page 20: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

• World’s largest handmade marketplace

– 8.9 million items

– 1 billion page view per month

– $320MM 2010 GMS

Page 21: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

• Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes

Production DB snapshots

Web event logs

ETL – Step 1 ETL – Step 2

Job

Job

Job

Page 22: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Recommendations The Taste Test http://www.etsy.com/tastetest

Page 23: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Recommendations

etsy.com/gifts

Gift Ideas for Facebook Friends

Page 24: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Page 25: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Yelp’s Business Generates a Lot of Data

400 GB of logs per day ~12 Terabytes per month

Page 26: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

They Frequently Analyze this Data to Power Key Features of their Site

Page 27: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Autocomplete Search

Page 28: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Recommendations

Page 29: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Automatic spelling corrections

Page 30: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Automatic spelling corrections

Let’s take a Look at how this works

Page 31: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Amazon S3

1) Load log file data for

six months of user search

history into Amazon S3

Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451

Page 32: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR Log Files

2) Spin up a 200 node

cluster of virtual servers

in the cloud

Hadoop Cluster

Page 33: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR

3) 200 nodes simultaneously

analyze this data looking for

common misspellings

… this takes a few hours

Hadoop Cluster

Page 34: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR

4) New common

misspellings and

suggestions loaded back

into S3

Hadoop Cluster

Log Files

Page 35: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Amazon S3

Amazon EMR

5) When the job is done,

the cluster is shut down.

Yelp only pays for the time

they used.

Log Files

Page 36: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Each of their 80 developers can do this whenever they have a big data problem to analyze

Log file

data

250 clusters spun up and down every week

Page 37: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Page 38: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Data size

• Global reach

• Native app for almost every smartphone, SMS, web, mobile-web

• 10M+ users, 15M+ venues, ~1B check-ins

• Terabytes of log data

Page 39: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Stack

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs

Dat

a St

ack Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Page 40: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Computing venue-to-venue similarity

• Spin up 40 node cluster

• Submit Ruby streaming job

– Invert User x Venue matrix

– Grab Co-occurrences

– Compute similarity

• Spin down cluster

• Load data to app server

Page 41: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Who is checking in?

0

0.1

0.2

0.3

0.4

0.5

0.6

Female Male

Gender

0 20 40 60 80

Age

Page 42: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

What are people doing?

Page 43: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Where are our users?

Page 44: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

When do people go to a place?

Gorilla Coffee

Gray's Papaya

Amorino

Thursday Friday Saturday Sunday

Page 45: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Why are people checking in? • Explore their city, discover new places

• Find friends, meet up

• Save with local deals

• Get insider tips on venues

• Personal analytics, diary

• Follow brands and celebrities

• Earn points, badges, gamification of life

• The list grows…

Page 46: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Over 1000’s customers using EMR

Page 47: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

RDBMS vs. MapReduce/Hadoop

• RDBMS Predefined schema

Strategic data placement for query tuning

Exploit indexes for fast retrieving

SQL only

Doesn’t scale linearly

• MapReduce/Hadoop No schema is required

Random data placement

Fast scan of the entire dataset

Uniform query performance

Linearly scales for reads and writes

Support many languages including SQL

Complementary technologies

Page 48: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

AWS Data Warehousing Architecture

Page 49: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Elastic Data Warehouse • Customize cluster size to support varying resource needs (e.g. query

support during the day versus batch processing overnight)

• Reduce costs by increasing server utilization

• Improve performance during high usage periods

Expand to 25 instances

Data Warehouse

(Steady State)

Data Warehouse

(Batch Processing)

Shrink to 9 instances

Data Warehouse

(Steady State)

Page 50: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Reducing Costs with Spot Instances

Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50% Cost Savings: ~22%

Page 51: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Big Data Ecosystem And Tools

We have a rapidly growing ecosystem

• Business Intelligence – MicroStrategy, Pentaho

• Analytics – Datameer, Karmasphere, Quest

• Open source – Ganglia, Squirrel SQL

Page 52: Analyzing Big Data with AWS

AWS Gov Cloud Summit II

Thank You!! http://aws.amazon.com/elasticmapreduce/