matchmaking in the cloud: amazon ec2 and apache hadoop at ... · > simple storage service (s3)...

87
Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eHarmony Steve Kuo, Software Architect Joshua Tuberville, Software Architect Speaker logo centered below image

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eHarmony

Steve Kuo, Software Architect Joshua Tuberville, Software Architect

Speaker logo centered below image

Page 2: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

2

Goal

>  Leverage EC2 and Hadoop to scale data intensive problem that are exceeding the limits of our data center and data warehouse environment.

Page 3: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

3

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 4: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

4

About eHarmony

>  Online subscription-based matchmaking service >  Launched in 2000 using compatibility models >  Available in United States, Canada, Australia and

United Kingdom. >  On average, 236 members in US marry every day. >  More than 20 million registered users.

Page 5: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

5

About eHarmony

>  Models are based on decades of research and clinical experience in psychology

>  Variety of user attributes   Demographic   Psychographic   Behavioral

Page 6: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

6

Model Creation

Research

Matches

Matching Process Models

Scorer

Scores

User Attributes

Page 7: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

7

Matching

Research

Matches

Matching Process Models

Scorer

Scores

User Attributes

Page 8: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

8

Predicative Model Scores

Research

Matches

Matching Process Models

Scorer

Scores

User Attributes

Page 9: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

9

Match Outcomes Feedback

Research

Matches

Matching Process Models

User Attributes

Scorer

Scores

Page 10: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

10

Model Validation and Creation

Research

Matches

Matching Process Models

User Attributes

Scorer

Scores

Page 11: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

11

Scorer Requirements

>  Tens of GB of matches, scores and constantly changing user features are archived daily

>  TB of data currently archived and growing >  10x our current user base >  All possible matches = O(n2) problem >  Support a growing set of models that may be

  arbitrarily complex   computationally and I/O expensive.

Page 12: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

12

Scaling Challenges

>  Current architecture is multi-tiered with a relational back-end

>  Scoring is DB join intensive >  Data need constant archiving

  Matches, match scores, user attributes at time of match creation

  Model validation is done at a later time across many days

>  Need a non-DB solution

Page 13: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

13

>  Open Source Java implementation of Google’s MapReduce paper   Created by Doug Cutting

>  Top-level Apache project   Yahoo is a major sponsor

Page 14: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

14

>  Distributes work across vast amounts of data

Page 15: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

15

>  Distributes work across vast amounts of data >  Hadoop Distributed File System (HDFS) provides

reliability through replication

Page 16: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

16

>  Distributes work across vast amounts of data >  Hadoop Distributed File System (HDFS) provides

reliability through replication >  Automatic re-execution on failure/distribution

Page 17: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

17

>  Distributes work across vast amounts of data >  Hadoop Distributed File System (HDFS) provides

reliability through replication >  Automatic re-execution on failure/distribution >  Scale horizontally on commodity hardware

Page 18: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

18

>  Simple Storage Service (S3) provides cheap unlimited storage Storage $0.15/GB First 50 TB/Month Transfer $0.17/GB First 10 TB/Month

Page 19: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

19

>  Simple Storage Service (S3) provides cheap unlimited storage.

>  Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. C1-medium $0.2/hour 2 cores 1.7 GB

Memory 32-bit OS

C1-xlarge $0.8/hour 8 cores 7 GB Memory

64-bit OS

Storage $0.15/GB First 50 TB/Month Transfer $0.17/GB First 10 TB/Month

Page 20: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

20

>  Elastic MapReduce is a hosted Hadoop framework on top EC2 and S3.

>  It’s in beta and US only. >  Pricing is in addition to EC2 and S3.

Instance Type EC2 Elastic MapReduce C1-medium $0.2/hour $0.03/hour C1-xlarge $0.8/hour $0.12/hour

Page 21: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

21

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 22: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

How MapReduce Works

>  Applications are modeled as a series of maps and reductions

>  Example - Word Count   Counts the frequency of words   Modeled as one Map and one Reduce   Data as key -> values

22

Page 23: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

WordCount-Input

IwillnoteatthemherenorthereIwillnoteatthemanywhere

WordCount

Page 24: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

WordCount–PasstoMappers

IwillnoteatthemherenorthereIwillnoteatthemanywhere

Map1

Map2

WordCount

Page 25: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

WordCount–PerformMap

IwillnoteatthemherenorthereIwillnoteatthemanywhere

Map1

Map2

I->1will->1not->1eat->1them->1...

I->1will->1not->1eat->1green->1...

WordCount

Page 26: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

WordCount–ShuffleandSort

I->1will->1not->1eat->1them->1...

I->1will->1not->1eat->1green->1...

WordCount

am->1and->1anywhere->1eat->1eat->1...

nor->1not->1not->1not->1not->1Sam->1...

Page 27: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

WordCount–PasstoReducersam->1and->1anywhere->1eat->1eat->1...

nor->1not->1not->1not->1not->1Sam->1...

Reduce1

Reduce2

Page 28: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

WordCount–PerformReduce

am,1and,1anywhere,1eat,4eggs,1...

nor,1not,4Sam,1them,3there,1...

am->1and->1anywhere->1eat->1eat->1...

nor->1not->1not->1not->1not->1Sam->1...

Reduce1

Reduce2

Page 29: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

29

Hadoop Benefits

>  Mapper and Reducer are written by you >  Hadoop provides

  Parallelization   Shuffle and sort

Page 30: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

30

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 31: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

31

Real World MapReduce

>  Complex applications use a series of MapReduces >  Values from one step can become keys for

another >  Match Scoring Example

  Join match data and user A attributes into one line   Join above with user B attributes and calculate the

match score   Group by match scores by user

Page 32: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Join:Map

AdamBettyCarlosDebbieAdamJune… Map1

Map2

Match

User Attribute

Adam->AdamBettyCarlos->CarlosDebbieAdam->AdamJune...

Adam->1Betty->2Carlos->3Debbie->4Fred->5June->6...

Page 33: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Join:ShuffleandSort

Adam->AdamBettyCarlos->CarlosDebbieAdam->AdamJune...

Adam->1Betty->2Carlos->3Debbie->4Fred->5June->6...

Adam->AdamBettyAdam->1Adam->AdamJune..

Carlos->CarlosDebbieCarlos->3...

Page 34: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Join:Reduce

Reduce1

Reduce2

Adam->AdamBettyAdam->1Adam->AdamJune...

Carlos->CarlosDebbieCarlos->3...

Adam->AdamBetty1Adam->AdamJune1...

Carlos->CarlosDebbie3...

Page 35: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

JoinandScore:Map

AdamBetty1CarlosDebbie3AdamJune1…

Map1

Map2

Betty->AdamBetty1Debbie->CarlosDebbie3June->AdamJune1...

Page 36: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

JoinandScore:ShuffleandSort

Betty->AdamBetty1Debbie->CarlosDebbie3June->AdamJune1...

Betty->AdamBetty1Betty->2Debbie->CarlosDebbie3Debbie->4...

June->AdamJune1June->6...

Page 37: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

JoinandScore:Reduce

Reduce1

Reduce2

AdamBetty->12AdamJune->16...

CarlosDebbie->34...

Betty->AdamBetty1Betty->2June->AdamJune1June->6...

Debbie->CarlosDebbie3Debbie->4...

Page 38: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

JoinandScore:Reduce(Optimized)

Reduce1

Reduce2

AdamBetty->3AdamJune->7...

CarlosDebbie->7...

Betty->AdamBetty1Betty->2June->AdamJune1June->6...

Debbie->CarlosDebbie3Debbie->4...

Page 39: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Combine:Map

AdamBetty3AdamJune7... Map1

Map2CarlosDebbie7...

Adam->AdamBetty3Adam->AdamJune7...

Carlos->CarlosDebbie7...

Page 40: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Combine:ShuffleandSort

Adam->AdamBetty3Adam->AdamJune7...

Carlos->CarlosDebbie7...

Adam->AdamBetty3Adam->AdamJune7...

Carlos->CarlosDebbie7...

Page 41: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Combine:Reduce

Reduce1

Reduce2

Adam:Betty=3June=7...

Carlos:Debbie=7...

Adam->AdamBetty3Adam->AdamJune7...

Carlos->CarlosDebbie7...

Page 42: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

42

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 43: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

43

Overview

Page 44: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

44

Hadoop Run – Data Warehouse Unload

Page 45: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

45

Hadoop Run – Upload to S3

Page 46: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

46

Hadoop Run – Start EC2 Cluster

Page 47: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

47

Hadoop Run – MapReduce Jobs

Page 48: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

48

Hadoop Run – Stop EC2 Cluster

Page 49: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

49

Hadoop Run – Download from S3

Page 50: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

50

Hadoop Run – Store Result

Page 51: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

51

AWS Elastic MapReduce

>  Only need to think in terms of Elastic MapReduce job flow

>  EC2 cluster is managed for you behind the scenes >  Each job flow has one or more steps >  Each step is a Hadoop MapReduce process >  Each step can read and write data directly from

and to S3 >  Based on Hadoop 0.18.3

Page 52: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

52

Elastic MapReduce for eHarmony

>  Vastly simplify our scripts   No need to explicitly allocate, start and shutdown

EC2 instances   Individual jobs were managed by a remote script

running on master node (no longer required)   Jobs are arranged into a job flow, created with a

single command   Status of a job flow and all its steps are accessible

by a REST service

Page 53: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Simplified Job Control

>  Before Elastic Map Reduce, we had to explicitly:   Allocate cluster   Verify cluster   Push application to cluster   Run a control script on the master   Kick off each job step on the master   Create and detect a job completion token   Shut the cluster down

>  Over 150 lines of script just for this

53

Page 54: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

Simplified Job Control

>  With Elastic MapReduce we can do all this with a single local command

>  Uses jar and conf files stored on S3

54

#{ELASTIC_MR_UTIL} --create --name #{JOB_NAME} --num-instances #{NUM_INSTANCES} --instance-type #{INSTANCE_TYPE} --key_pair #{KEY_PAIR_NAME} --log-uri #{SCORER_LOG_BUCKET_URL} --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.join.JoinJob --arg -xconf --arg #{MASTER_CONF_DIR}/join-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.scorer.ScorerJob --arg -xconf --arg #{MASTER_CONF_DIR}/scorer-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.combiner.CombinerJob --arg -xconf --arg #{MASTER_CONF_DIR}/combiner-config-#{TARGET_ENV}.xml

Page 55: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

55

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 56: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

56

Development Environment

>  Cheap to set up on Amazon >  Quick setup

  Number of servers is controlled by a config variable >  Identical to production >  Separate development account recommended

Page 57: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

57

Testing Demo >  Test Hadoop jobs in IDE >  A Hadoop cluster is not required

Page 58: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

58

Unit Test Demo

Page 59: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

59

Unit Test Demo

Page 60: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

60

EC2, S3 and Hadoop Monitoring Tools

>  ElasticFox for EC2 >  Hadoop provides web pages for job and disk

monitoring. >  Tim Kay’s AWS command line tool for S3 >  Plus many more

Page 61: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

61

ElasticFox – Manage EC2 Cluster

Page 62: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

62

Hadoop JobTracker – Monitor Jobs

Page 63: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

63

Hadoop DFS – Monitor Disk Usage

Page 64: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

64

AWS Management Console

>  Useful for Elastic MapReduce   Terminate job flow   Track execution

>  Also monitor EC2

Page 65: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

65

AWS Management Console Elastic MapReduceBETA

Page 66: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

66

AWS Management Console EC2

Page 67: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

67

Hadoop on EC2 Performance Minutes

Hadoop 0.18.0

331

212

144

100

0

50

100

150

200

250

300

350

c1-medium, 24 nodes c1-xlarge, 24 nodes c1-medium, 49 nodes c1-xlarge, 49 nodes

Page 68: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

68

Total Execution Time

Page 69: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

69

Our Cost

>  Average EC2 and S3 Cost   Each run is 2 to 3 hours   $1200/month for EC2   $100/month for S3

Page 70: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

70

Our Cost

>  Average EC2 and S3 Cost   Each run is 2 to 3 hours   $1200/month for EC2   $100/month for S3

>  Projected in-house cost   $5000/month for a local cluster of 50 nodes running

24/7   A new company needs to add data center and

operation personnel expense

Page 71: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

71

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 72: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

72

Process Control is Non-Trivial

>  Scripts galore >  One bash script for each stage >  Moving to Ruby

  Good exception handling and control structures   More productive

Page 73: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

73

Design for Failure

>  The overall process depends on the success of each stage

>  Every stage is unreliable >  Hadoop master node is a single point of failure;

slave nodes are fault tolerant >  Need to build retry logic to handle failures

Page 74: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

74

Design for Failure S3 Web Service >  S3 web service can time out >  Add logic to validate file is correctly uploaded/

downloaded from S3 >  We retry once on failure

Page 75: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

75

Design for Failure EC2 Web Service >  On very rare occasions, EC2 would allocate less

slave nodes than requested – usually one less. >  On even rarer occasions, no node is allocated. >  Right now script fails and sends an alert email. >  Consider changing the script to have it continue if

the number of slave nodes missed is small >  Or reallocate the missing nodes

Page 76: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

76

Design for Failure Elastic MapReduceBETA >  Provisioning of the servers is not yet stable

  Frequently failed in the first two weeks of the beta program

  It blocks until provisioning is complete   Only billed for job execution time

>  It’s handy to terminate a hanged job with Amazon Web Services Management Console

>  Amazon recognizes this shortcoming and is working on a solution

Page 77: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

77

Data Shuffle

>  Spend same amount of time getting data out of DW, upload to and download from S3, creating local store as running Hadoop

>  Hadoop and EC2 scale nicely >  New scaling challenge is to reduce the data shuffle

time and error recovery.

Page 78: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

78

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 79: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

79

Handle More Modeling Data

>  Data warehouse unloads additional dimensions of user data into separate files for performance reason.

>  Currently user and match data is joined into one row which is not scalable.

>  Each model does not use all data.   Partial joins   Partial scores   Combine partial scores to final score   Load the delta of user data between runs

Page 80: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

80

Data Analysis on the Cloud

>  Daily reporting: use Hadoop instead of depending on data warehouse.   Median/Mean score per user   Median/Mean number of matches per user, country,

zipcode

Page 81: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

81

Data Analysis on the Cloud with Pig

>  Our success opens Hadoop to other teams >  Provides expertise on Hadoop >  Data analysis with Pig

  High-level language on top of Hadoop: think SQL for RDBMS

  Hadoop subproject   Still not non-engineer friendly for troubleshooting   Slower than hand-rolled MapReduce

>  Yet to evaluate Hive, Cascade, others

Page 82: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

82

Data Analysis with R

>  Data analysis with R   Excellent environment for statistical computing and

graphic presentation   Open source implementation of S   Supports a plugin framework   New packages are continued being developed by an

active community   Limitation is that all data must be in memory   Start to investigate RHIPE which aims to integrate R

and Hadoop

Page 83: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

83

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 84: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

84

Lesson Learned

>  Reliability is the biggest challenge >  Getting data out of DW is difficult and time

consuming >  Controlling process with shell scripts is a hassle >  Dev tools really easy to work with and just work

right out of the box >  Standard Hadoop AMI worked great >  Easy to write unit tests for MapReduce >  Hadoop community support is great. >  EC2/S3/EMR are cost effective

Page 85: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

85

Acknowledgement

>  236 members marry a day is based on survey conducted by Harris Interactive Research.

>  Hadoop is an Apache project >  S3 and EC2 are part of Amazon Web Service. >  R’s URL is http://www.r-project.org/ >  RHIPE’s URL is http://ml.stat.purdue.edu/rhipe

Page 86: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

86

Agenda

>  Background and Motivation >  Hadoop Overview >  Hadoop at eHarmony >  Architecture >  Tools and Performance >  Roadblocks >  Future Directions >  Summary >  Questions

Page 87: Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at ... · > Simple Storage Service (S3) provides cheap unlimited storage. > Elastic Cloud Computing (EC2) enables horizontal

87

Steve Kuo, Software Architect Joshua Tuberville, Software Architect eHarmony