collaborative filtering in map/reduce
TRANSCRIPT
Collaborative Filteringin
Map/Reduce
Ole-Martin Mørk - Open AdExchange
tirsdag 14. september 2010
Vision
• Learn that Map/Reduce is simple
• Learn that Map/Reduce may be powerful
• Collaborative Filtering is fun!
tirsdag 14. september 2010
Agenda
• Map/Reduce
• Collaborative Filtering
• Collaborative Filtering with Map/Reduce
• Amazon Elastic MapReduce
tirsdag 14. september 2010
Map/Reduce
tirsdag 14. september 2010
Map/Reduce
• Very scalable algorithm
• Inspirered by map and reduce from functional programming.
• Everything is based on key/value
tirsdag 14. september 2010
6 phases
• Reader
• Map
• Partition
• Comparison
• Reduce
• Writer
tirsdag 14. september 2010
6 phases
• Reader
•Map
• Partition
• Comparison
•Reduce
• Writer
tirsdag 14. september 2010
Map
tirsdag 14. september 2010
List(“hello”,“dude”).map{x=>x.substring(0,1)}
functional map
tirsdag 14. september 2010
Map/Reduce map
• Input is key/value
• Output is key/value
tirsdag 14. september 2010
Simple Example, Map
• Count occurences of words in a document
• Input is: <linenumber>, <content of line>
• For each word on the line, the output is <word>, <count>
tirsdag 14. september 2010
Map
tirsdag 14. september 2010
Reducetirsdag 14. september 2010
functional reduce
val sum=List(32,40,23).reduceLeft{_+_}
tirsdag 14. september 2010
Map/Reduce reduce
• Input is key/list of values
• Output is key/value
tirsdag 14. september 2010
Simple Example, Reduce
• Reduce input is <word, counts>
• For each value we increase the count
• Output is <word>, <sum of counts>
tirsdag 14. september 2010
Reduce
tirsdag 14. september 2010
CollaborativeFiltering
tirsdag 14. september 2010
Amazon
tirsdag 14. september 2010
Last.fm
tirsdag 14. september 2010
Sceneami.com
tirsdag 14. september 2010
User based
• Useful when we have
• Small number of users
• High correlation between users
• Data that changes often
tirsdag 14. september 2010
Item based
• Useful for big sites like Amazon etc..
• Small overlap between users
• Mostly static data
tirsdag 14. september 2010
Min
drø
mm
eapp
likas
jon
Pattern Matching in Scala
Euclidean Distance
Rating
Rating
Match
Match
tirsdag 14. september 2010
Euclidean Distance
• Alf‘s presentations:1,25,56,57,58,98 (6)
• Kari’s presentations: 2,25,98,99 (4)
• Equal presentations: 25 and 98 (2)
• Unmatched presentations: 6-2 + 4-2 = 6
• Distance score: 1/1+sqr(6)= 0.29
tirsdag 14. september 2010
Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
tirsdag 14. september 2010
Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
• Recommended: 8 (0.62)
tirsdag 14. september 2010
Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
• Recommended: 8 (0.62), 9 (0.62)
tirsdag 14. september 2010
Recommended sessions
• Me:1,2,5,6,7
• Kate (0.31): 5,6,8,9
• Paul (0.41): 1,2,4,5,6
• Mary(0.31):1,5,8,9
• Recommended: 8 (0.62), 9 (0.62), 4 (0.41)
tirsdag 14. september 2010
Demo
tirsdag 14. september 2010
More Map/Reduce
tirsdag 14. september 2010
Several iterations
Iteration 1
Iteration 2
Iteration 3
tirsdag 14. september 2010
Several iterations
Iteration 3
Iteration 1 Iteration 2
tirsdag 14. september 2010
Partitioning
Reducer Reducer
Jeff
Kate
Mary
Ali
Lea
Paul
Paul Mary Kate Lea Jeff Ali
tirsdag 14. september 2010
Comparison
Reducer Reducer
Pres 2
Kate
Pres 2 JeffPres 2
Mary
Pres 1
Paul
Pres 1 AliPres 1
Lea
Pres 1Pres 1Pres 1 Pres 2Pres 2Pres 2Paul Lea Ali Jeff Mary Kate
tirsdag 14. september 2010
Guidelines
• Never access external sources during computation.
• Your functions should be small and fast
• You might not have all the data available
tirsdag 14. september 2010
Hadoop
• Hadoop is reusing objects, so remember to clone if you plan to keep them.
• You can read and write all objects implementing hadoop.WritableComparable
• write(DataOutput)
• readFields(DataInput)
• compareTo(Object)
tirsdag 14. september 2010
Collaborative Filtering, the Map/Reduce way
tirsdag 14. september 2010
Overview
• Create an application that recommends JavaZone presentations.
• Overall goal: Scalable performance
• 4 iterations
• Reading input from text file
tirsdag 14. september 2010
Iteration 1
• Map input: <user>, <presentations>
• Map output: <presentation>, <user>
• Reduce output: <presentation>, <userList>
tirsdag 14. september 2010
Iteration 2
• Map input: <presentation>, <userList>
• Map output: <user>, <userList>
• Reduce input: <user>, <list of userList>
• Reduce output: <userTuplet>, <match count>
tirsdag 14. september 2010
Iteration 3
• Map input: <userTuplet>, <match count>
• Map output: <userTuplet>, <diff>
• Map output: <userTuplet reversed>, <diff>
• Reduce output: <user>, <similaruser>
tirsdag 14. september 2010
Iteration 4
• Map input: <user>, <similaruser>
• Map output: <user>, <presentation with score>
• Reduce output: <user>, <presentations>
tirsdag 14. september 2010
Demo
tirsdag 14. september 2010
Map/Reduce on EC2
tirsdag 14. september 2010
Elastic Map/Reduce
• Same code
• Same input
• Different configuration
tirsdag 14. september 2010
Upload files
s3cmd put oax-jz10:jar/oax-jz10.jar target/oax.jz10.jar
s3cmd.rb put oax-jz10:input/data.txt data.txt
tirsdag 14. september 2010
Create job flow
elastic-mapreduce --create --alive --log-uri s3n://oax-jz10/log
tirsdag 14. september 2010
Register iterations
elastic-mapreduce --jobflow j-1NLAIW45QUN4B --jar s3n://oax-jz10/jar/oax-jz10.jar --arg com.openadex.pres.iterations.Iteration1 --arg s3n://oax-jz10/input --arg s3n://oax-jz10/output1
tirsdag 14. september 2010
Download output
s3cmd.rb get oax-jz10:output4/part-00000 out
tirsdag 14. september 2010
Demo
tirsdag 14. september 2010
Summary
• Map/Reduce may be simple
• Map/Reduce can be really powerful
• Collaborative filtering is fun :-)
tirsdag 14. september 2010
tirsdag 14. september 2010
Thank you
Ole-Martin Mø[email protected]/olemartin
del.icio.us/olemartin/jz10
All images are licensed with Creative Commons. See http://bit.ly/mr-photos for details,
tirsdag 14. september 2010