eharmony in the cloud

28
eHarmony in Cloud Subtitle Brian Ko

Upload: craig-dickson

Post on 22-May-2015

1.892 views

Category:

Technology


3 download

DESCRIPTION

This is a lightning presentation given by Brian Ko a member of my development team. It is a recap of a presentation he attended at JavaOne 2009.

TRANSCRIPT

Page 1: eHarmony in the Cloud

eHarmony in Cloud

Subtitle

Brian Ko

Page 2: eHarmony in the Cloud

eHarmony

• Online subscription-based matchmaking service

• Available in United States, Canada, Australia and United Kingdom.

• On average, 236 members in US marry every day.

• More than 20 million registered users.

1

Page 3: eHarmony in the Cloud

Why Cloud?

• Problem exceeds the limits of the data center and data warehouse environment.

• Leverage EC2 and Hadoop to scale data

2

Page 4: eHarmony in the Cloud

Finding match

3

• Model Creation

Page 5: eHarmony in the Cloud

Find matching

• Matching

4

Page 6: eHarmony in the Cloud

Find Matching

• Predicative Model Scores

5

Page 7: eHarmony in the Cloud

Requirement

• All the matches, scores, and user information should be archived daily

• Ready for 10X growth

• Possible O(n2) problem

• Need to support set of models becoming more complex

6

Page 8: eHarmony in the Cloud

Challenge

• Current architecture is multi-tiered with a relational back-end

• Scoring is DB join intensive• Data need constant archiving

– Matches, match scores, user attributes at time of match creation

– Model validation is done at a later time across many days

• Need a non-DB solution

7

Page 9: eHarmony in the Cloud

Solution

• Open Source Java implementation of Google’s MapReduce framework

– Distributes work across vast amounts of data– Hadoop Distributed File System (HDFS)

provides reliability through replication– Automatic re-execution on failure/distribution– Scale horizontally on commodity hardware

8

Page 10: eHarmony in the Cloud

Slide 9

• Simple Storage Service (S3) provides cheap unlimited storage.

• Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand.

9

Page 11: eHarmony in the Cloud

MapReduce

• A large server farm can use MapReduce to process huge dataset.

• Map step– Master node takes the input– Chops it up into smaller sub-problems– Distributes those to worker nodes.

• Reduce step– Master node takes the answers to all the sub-

problems – Combines them in a way to get the output 

10

Page 12: eHarmony in the Cloud

Why Hadoop

• Mapper and Reducer are written by you

• Hadoop provides– Parallelization– Shuffle and sort

11

Page 13: eHarmony in the Cloud

Actual Process

• Upload to S3 and start EC2 Cluster

13

Page 14: eHarmony in the Cloud

Actual Process

• Process and archive

14

Page 15: eHarmony in the Cloud

Amazon Elastic MapReduce

• It is a web service

• EC2 cluster is managed for you behind the scenes

• Starts Hadoop implementation of the MapReduce framework on Amazon EC2

• Each step can read and write data directly from and to S3

• Based on Hadoop 0.18.3

15

Page 16: eHarmony in the Cloud

Elastic MapReduce

• No need to explicitly allocate, start and shutdown EC2 instances

• Individual jobs were managed by a remote script running on master node (no longer required)

• Jobs are arranged into a job flow, created with a single command

• Status of a job flow and all its steps are accessible by a REST service

16

Page 17: eHarmony in the Cloud

Before Elastic Map Reduce

• Allocate/Verify cluster

• Push application to cluster

• Run a control script on the master

• Kick off each job step on the master

• Create and detect a job completion token

• Shut the cluster down

17

Page 18: eHarmony in the Cloud

After Elastic MapReduce

• With Elastic MapReduce we can do all this with a single local command

• Uses jar and conf files stored on S3

• Various monitoring tools for EC2 and S3 are provided

18

Page 19: eHarmony in the Cloud

Development Environment

• Cheap to set up on Amazon

• Quick setup - Number of servers is controlled by a config variable

• Identical to production

• Separate development account recommended

19

Page 20: eHarmony in the Cloud

Cost comparison

• Average EC2 and S3 Cost– Each run is 2 to 3 hours– $1200/month for EC2– $100/month for S3

• Projected in-house cost– $5000/month for a local cluster of 50 nodes

running 24/7– A new company needs to add data center and

operation personnel expense

20

Page 21: eHarmony in the Cloud

Summary

• Dev tools really easy to work with and just work right out of the box

• Standard Hadoop AMI worked great

• Easy to write unit tests for MapReduce

• Hadoop community support is great.

• EC2/S3/EMR are cost effective

Page 22: eHarmony in the Cloud

The End

5 minutes of question time

starts now!

Page 23: eHarmony in the Cloud

Questions

4 minutes left!

Page 24: eHarmony in the Cloud

Questions

3 minutes left!

Page 25: eHarmony in the Cloud

Questions

2 minutes left!

Page 26: eHarmony in the Cloud

Questions

1 minute left!

Page 27: eHarmony in the Cloud

Questions

30 seconds left!

Page 28: eHarmony in the Cloud

Questions

TIME IS UP!