identifying and incorporating latencies in distributed data mining algorithms michael sevilla

28
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Upload: horace-richard

Post on 25-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Michael Sevilla

Page 2: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Michael SevillaX

Page 3: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Applicability of Mahout for Large Data Sets

Michael Sevilla

Page 4: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

What is Mahout?

• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop

http://heureka.blogetery.com/

Page 5: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

The Data: Million Song Data Set

• Large Data Set– 1,019,318 users– 384,546 MSD songs– 48,373,586 (user, song, count)

• Kaggle Competition: offline evaluation– Predict songs a user will listen to using• Training: 1M user listening history• Validation: 110K users

• “Martin L” blogged his methodology + results

Page 6: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

22 vs.

Motivations

• Can Mahout easily be modified?• Can Mahout perform well for this workload?• Can Mahout produce accurate results?• Can Mahout work ‘out of box’?

• Hypothesis: 22 machines + Mahout > 1 guy

Page 7: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

What kind of Recommender?

• Format: <userID, songID, count>• Users interacting with items• Users express preferences towards items

• We can us Collaborative Filtering

22 vs.

Page 8: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Collaborative Filtering

• Predicts preference of user towards an item• Constructs a Top-N-Recommendation

1. Parse input training data2. Create user-item-matrix3. Predict missing entries

Mahout has item-based Collaborative Filtering jobs!

Page 9: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT EASILY BE MODIFIED?

Page 10: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Martin’s Code

• Methodology: similarity vector of history– Sparse-matrix• COLISTEN(i, j) – listeners who listened to i and j

– Sum similarities for each song user x listens to• The code: all python– Parse: 27 lines of code (l.o.c)– Create Matrix: 46 l.o.c– Predict: 45 l.o.c

Page 11: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Mahout’s Code

• Methodology: – No Idea…

• The code: all java– Poorly commented– 14 *.java files – Many Directories

• ~/mahout/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java

– RecommenderJob.java: 284 lines of code (l.o.c)– SimilarityMatrixRowWrapperMapper.java: 47 l.o.c– UserVectorSplitterMapper.java: 138 l.o.c

Page 12: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Mahout’s Code

Page 13: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT EASILY BE MODIFIED?

NO

Page 14: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?

Page 15: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

• Performance on 86MB: – Parse data: 10 minutes– Make Matrix: 22 minutes– Predict songs for 11000 users: 1 hour, 18 minutes

• Did not test scalability

$/ python convertToNumbers.py$/ python colisten.py$/ python predict_colisten.py

Martin’s Code

Page 16: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

• Performance on 86MB:– Parse Time: 10 minutes– Total Time: 25 minutes

• Tested scalability– 64MB, 128MB, 256MB, 1GB, 2GB, 3GB

Mahout’s Code

Page 17: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Mahout’s Code

• Total Time• ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr ….

10 Nodes Failed

Page 18: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

• Prepare Jobs (parse): seconds - minutes

Mahout’s Code

Page 19: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Mahout’s Code

• Recommend Jobs (predict): seconds - minutes

Page 20: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Mahout’s Code

• Create Matrix Jobs: minutes - hours

Page 21: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?

NO

Page 22: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT PRODUCE ACCURATE RESULTS?

Page 23: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Training Set

• Kaggle Million Song Subset: 110K users– User 2: 16 entries – took out 8– User 16: 32 entries – took out 8– User 17: 25 entries – took out 8

Page 24: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

User 2:

User 16:

User 17:

where Q is the number of queries Martin’s Code

Page 25: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

User 2:

User 16:

User 17:

where Q is the number of queries Mahout’s Code

Page 26: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT PRODUCE ACCURATE RESULTS?

YES

Page 27: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

CAN MAHOUT WORK ‘OUT OF BOX’?

YES… but not well

Page 28: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Conclusion

• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable

• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology