collaborative filtering in map/reduce
Embed Size (px)
TRANSCRIPT

Collaborative Filteringin
Map/Reduce
Ole-Martin Mørk - Open AdExchange
tirsdag 14. september 2010

Vision
• Learn that Map/Reduce is simple
• Learn that Map/Reduce may be powerful
• Collaborative Filtering is fun!
tirsdag 14. september 2010

Agenda
• Map/Reduce
• Collaborative Filtering
• Collaborative Filtering with Map/Reduce
• Amazon Elastic MapReduce
tirsdag 14. september 2010

Map/Reduce
tirsdag 14. september 2010

Map/Reduce
• Very scalable algorithm
• Inspirered by map and reduce from functional programming.
• Everything is based on key/value
tirsdag 14. september 2010

6 phases
• Reader
• Map
• Partition
• Comparison
• Reduce
• Writer
tirsdag 14. september 2010

6 phases
• Reader
•Map
• Partition
• Comparison
•Reduce
• Writer
tirsdag 14. september 2010

Map
tirsdag 14. september 2010

List(“hello”,“dude”).map{x=>x.substring(0,1)}
functional map
tirsdag 14. september 2010

Map/Reduce map
• Input is key/value
• Output is key/value
tirsdag 14. september 2010

Simple Example, Map
• Count occurences of words in a document
• Input is: <linenumber>, <content of line>
• For each word on the line, the output is <word>, <count>
tirsdag 14. september 2010

Map
tirsdag 14. september 2010

Reducetirsdag 14. september 2010

functional reduce
val sum=List(32,40,23).reduceLeft{_+_}
tirsdag 14. september 2010

Map/Reduce reduce
• Input is key/list of values
• Output is key/value
tirsdag 14. september 2010

Simple Example, Reduce
• Reduce input is <word, counts>
• For each value we increase the count
• Output is <word>, <sum of counts>
tirsdag 14. september 2010

Reduce
tirsdag 14. september 2010

CollaborativeFiltering
tirsdag 14. september 2010

Amazon
tirsdag 14. september 2010

Last.fm
tirsdag 14. september 2010

Sceneami.com
tirsdag 14. september 2010

User based
• Useful when we have
• Small number of users
• High correlation between users
• Data that changes often
tirsdag 14. september 2010

Item based
• Useful for big sites like Amazon etc..
• Small overlap between users
• Mostly static data
tirsdag 14. september 2010

Min
drø
mm
eapp
likas
jon
Pattern Matching in Scala
Euclidean Distance
Rating
Rating
Match
Match
tirsdag 14. september 2010

Euclidean Distance
• Alf‘s presentations:1,25,56,57,58,98 (6)
• Kari’s presentations: 2,25,98,99 (4)
• Equal presentations: 25 and 98 (2)
• Unmatched presentations: 6-2 + 4-2 = 6
• Distance score: 1/1+sqr(6)= 0.29
tirsdag 14. september 2010

Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
tirsdag 14. september 2010

Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
• Recommended: 8 (0.62)
tirsdag 14. september 2010

Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
• Recommended: 8 (0.62), 9 (0.62)
tirsdag 14. september 2010

Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
• Recommended: 8 (0.62), 9 (0.62), 4 (0.41)
tirsdag 14. september 2010

Demo
tirsdag 14. september 2010

More Map/Reduce
tirsdag 14. september 2010

Several iterations
Iteration 1
Iteration 2
Iteration 3
tirsdag 14. september 2010

Several iterations
Iteration 3
Iteration 1 Iteration 2
tirsdag 14. september 2010

Partitioning
Reducer Reducer
Jeff
Kate
Mary
Ali
Lea
Paul
Paul Mary Kate Lea Jeff Ali
tirsdag 14. september 2010

Comparison
Reducer Reducer
Pres 2
Kate
Pres 2 JeffPres 2
Mary
Pres 1
Paul
Pres 1 AliPres 1
Lea
Pres 1Pres 1Pres 1 Pres 2Pres 2Pres 2Paul Lea Ali Jeff Mary Kate
tirsdag 14. september 2010

Guidelines
• Never access external sources during computation.
• Your functions should be small and fast
• You might not have all the data available
tirsdag 14. september 2010

Hadoop
• Hadoop is reusing objects, so remember to clone if you plan to keep them.
• You can read and write all objects implementing hadoop.WritableComparable
• write(DataOutput)
• readFields(DataInput)
• compareTo(Object)
tirsdag 14. september 2010

Collaborative Filtering, the Map/Reduce way
tirsdag 14. september 2010

Overview
• Create an application that recommends JavaZone presentations.
• Overall goal: Scalable performance
• 4 iterations
• Reading input from text file
tirsdag 14. september 2010

Iteration 1
• Map input: <user>, <presentations>
• Map output: <presentation>, <user>
• Reduce output: <presentation>, <userList>
tirsdag 14. september 2010

Iteration 2
• Map input: <presentation>, <userList>
• Map output: <user>, <userList>
• Reduce input: <user>, <list of userList>
• Reduce output: <userTuplet>, <match count>
tirsdag 14. september 2010

Iteration 3
• Map input: <userTuplet>, <match count>
• Map output: <userTuplet>, <diff>
• Map output: <userTuplet reversed>, <diff>
• Reduce output: <user>, <similaruser>
tirsdag 14. september 2010

Iteration 4
• Map input: <user>, <similaruser>
• Map output: <user>, <presentation with score>
• Reduce output: <user>, <presentations>
tirsdag 14. september 2010

Demo
tirsdag 14. september 2010

Map/Reduce on EC2
tirsdag 14. september 2010

Elastic Map/Reduce
• Same code
• Same input
• Different configuration
tirsdag 14. september 2010

Upload files
s3cmd put oax-jz10:jar/oax-jz10.jar target/oax.jz10.jar
s3cmd.rb put oax-jz10:input/data.txt data.txt
tirsdag 14. september 2010

Create job flow
elastic-mapreduce --create --alive --log-uri s3n://oax-jz10/log
tirsdag 14. september 2010

Register iterations
elastic-mapreduce --jobflow j-1NLAIW45QUN4B --jar s3n://oax-jz10/jar/oax-jz10.jar --arg com.openadex.pres.iterations.Iteration1 --arg s3n://oax-jz10/input --arg s3n://oax-jz10/output1
tirsdag 14. september 2010

Download output
s3cmd.rb get oax-jz10:output4/part-00000 out
tirsdag 14. september 2010

Demo
tirsdag 14. september 2010

Summary
• Map/Reduce may be simple
• Map/Reduce can be really powerful
• Collaborative filtering is fun :-)
tirsdag 14. september 2010

tirsdag 14. september 2010

Thank you
Ole-Martin Mø[email protected]/olemartin
del.icio.us/olemartin/jz10
All images are licensed with Creative Commons. See http://bit.ly/mr-photos for details,
tirsdag 14. september 2010