large-scale recommender systems on just a pc

DRAFT SKETCH

Large-scale Recommender Systems on Just a PCLSRS 2013 keynote (RecSys 13 Hong Kong)

Aapo KyrlPh.D. candidate @ CMU

http://www.cs.cmu.edu/~akyrolaTwitter: @kyrpovBig Data small machineThis talk has two main goals: 1) to little bit challenge how we think about scalability: in this case, show how just a single machine, a Mac Mini, can solve very big problems that people often use something like Hadoop for; 2) to talk about GraphChi, which is my research system and show how to implement rec sys for that.1My BackgroundAcademic: 5th year Ph.D. @ Carnegie Mellon. Advisors: Guy Blelloch, Carlos Guestrin (UW)Startup Entrepreneur

2009 2012 + Shotgun : Parallel L1-regularized regression solver (ICML 2011).+ Internships at MSR Asia (2011) and Twitter (2012)

Habbo : founded 2000HOW MANY KNOW GRAPHLAB? So because of my industry experience, on working with very large systems, I always focus on very practical solutions. And it is because of this experience of working with distributed systems, that I really understand the benefits of avoiding it!2Outline of this talkWhy single-computer computing?Introduction to graph computation and GraphChiRecommender systems with GraphChiFuture directions & ConclusionWhy on a single machine?Cant we just use the Cloud?Large-Scale Recommender Systems on Just a PC4Why use a cluster?Two reasons:One computer cannot handle my problem in a reasonable time.

I need to solve the problem very fast.Let me ask it otherway round. Why would you want to use a cluster?

Most people do not have multi-tera or petabyte datasets. 5Why use a cluster?Two reasons:One computer cannot handle my problem in a reasonable time.

I need to solve the problem very fast.Our work expands the space of feasible (graph) problems on one machine:Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop)Most data not that big.Our work raises the bar on required performance for a complicated system.Let me ask it otherway round. Why would you want to use a cluster?6Benefits of single machine systemsAssuming it can handle your big problemsProgrammer productivityGlobal stateCan use real data for developmentInexpensive to install, administer, less power.Scalability.Efficient Scaling

TimeTDistributed Graph SystemSingle-computer system (capable of big tasks)Task 1Task 2

Task 3Task 4Task 5Task 6TimeT6 machines12 machinesTask 1Task 2Task 3Task 4Task 5Task 6Task 10Task 11Task 12(Significantly) less than 2x throughput with 2x machinesExactly 2x throughput with 2x machinesThis is a made-up example to illustrate a point. Relate to netflix off-line.

Here we have chosen T to be the time the single machine system, such as GraphChi, solves the one task. Lets assume the cluster system needs 6 machines to solve the problem, and does it about 7 times faster than GraphChi. Then in Time T it solves 7 tasks while GraphChi solves 6 tasks with the same cluster.

Now if we double the size of the cluster, to twelve machines: cluster systems never have linear speedup, so lets assume the performance increases by say 50%. Of course this is just fake numbers, but similar behavior happens at some cut-off point anyway. Now GraphChi will solve exactly twice the number of tasks in time T.

8

We are not only ones thinking this way

Add MSR paper?9Graph computation and graphchiWhy graphs for recommender systems?Graph = matrix: edge(u,v) = M[u,v]Note: always sparse graphsIntuitive, human-understandable representationEasy to visualize and explain.Unifies collaborative filtering (typically matrix based) with recommendation in social networks.Random walk algorithms.Local view vertex-centric computation

Vertex-Centric Computational ModelGraph G = (V, E) directed edges: e = (source, destination)each edge and vertex associated with a value (user-defined type)vertex and edge values can be modified(structure modification also supported)

DataDataDataDataDataDataDataDataDataData12GraphChi Aapo KyrolaABLets now discuss what is the computational setting of this work. Lets first introduce the basic computational model. 12DataDataDataDataDataDataDataDataDataDataVertex-centric ProgrammingThink like a vertexPopularized by the Pregel and GraphLab projectsMyFunc(vertex) { // modify neighborhood }DataDataDataDataDataNote about edge-centric?13

What is GraphChi

2

Both in OSDI12!So as a recap, GraphChi is a disk-based GraphLab. While GraphLab2 is incredibly powerful on big clusters, or in the cloud, you can use GraphChi to solve as big problems on just a Mac Mini. Of course, GraphLab can solve the problems way faster but I believe GraphChi provides performance that is more then enough for many.

Spin-off of GraphLab projectDisk based GraphLabOSDI12

14The Main Challenge of Disk-based Graph Computation:Random Access

~ 100K reads / sec (commodity)~ 1M reads / sec (high-end arrays)

20B edges [Gupta et al 2013])GraphChi is Open SourceC++ and Java-versions in GitHub: http://github.com/graphchiJava-version has a Hadoop/Pig wrapper.If you really really want to use Hadoop.

Recsys model training with graphchiOverview of Recommender Systems for GraphChiCollaborative Filtering toolkit (next slide)Link prediction in large networksRandom-walk based approaches (Twitter)Talk on Wednesday.

GraphChis Collaborative Filtering ToolkitDeveloped by Danny Bickson (CMU / GraphLab Inc)Includes:Alternative Least Squares (ALS)Sparse-ALSSVD++LibFM (factorization machines)GenSGDItem-similarity based methodsPMFCliMF (contributed by Mark Levy).

Note: In the C++ -version.

Java-version in development by a CMU team.See Dannys blog for more information: http://bickson.blogspot.com/2012/12/collaborative-filtering-with-graphchi.html

Two examples: ALS and item-based CFExample: Alternative Least Squares Matrix Factorization (ALS)Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: Large-Scale Parallel Collaborative Filtering for the Netflix Prize (2008)Task: Predict ratings for items (movies) by users.Model: Latent factor model (see next slide)

ALS: Product Item bipartite graphCity of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of aNervous Breakdown

43250.42.3-1.82.91.2-3.22.80.90.24.18.72.90.042.13.1412.32.53.90.020.04Users rating of a movie modeled as a dot-product: 25ALS: GraphChi implementationUpdate function handles one vertex a time (user or movie)For each user:Estimate latent(user): minimize least squares of dot-product predicted ratingsGraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from diskLatent factors in memory: need O(V) memory.If factors dont fit in memory, can replicate to edges. and thus store on disk

Scales to very large problems!ALS: PerformanceMatrix Factorization (Alternative Least Squares)Remark: Netflix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-linear in #ratings).Example: Item Based-CFTask: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common.Similarity(X, Y) ~ # common viewersOutput top K similar items for each item to a file. or: create edge between X, Y containing the similarity.

Problem: enumerating all pairs takes too much time.City of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of aNervous Breakdown

3Solution: Enumerate all triangles of the graph.

New problem: how to enumerate triangles if the graph does not fit in RAM?

Enumerating Triangles (Item-CF)Triangles with edge (u, v) = intersection(neighbors(u), neighbors(v))Iterative memory efficient solution (next slide)PIVOTSAlgorithm:Let pivots be a subset of the vertices;Load all neighbor-lists (adjacency lists) of pivots into RAMUse now GraphChi to load all vertices from disk, one by one, and compare their adjacency lists to the pivots adjacency lists (similar to merge).Repeat with a new subset of pivots.Triangle Counting PerformanceTriangle CountingFuture directions & Final remarksSingle-Machine Computing in Production?GraphChi supports incremental computation with dynamic graphs:Can keep on running indefinitely, adding new edges to the graph Constantly fresh model.However, requires engineering not included in the toolkit.Compare to a cluster-based system (such as Hadoop) that needs to compute from scratch.

Efficient ScalingBusinesses need to compute hundreds of distinct tasks on the same graphExample: personalized recommendations.Parallelize each taskParallelize across tasksTaskTaskTaskTaskTaskTaskTaskTask

Task Task Task TaskTaskAnother, perhaps a bit surprising motivation comes from thinking about scalability in large scale.

The industry wants to compute many tasks on the same graph. For example, to compute personalizedRecommendations, same task is computed for people in different countries, different interests groups, etc.

Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster.

But this work allows a different way. Since one machine can handle one big task, you can dedicate one taskPer machine.

Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machines

There are other motivations as well, such as reducing costs and energy. But lets move on.35Single Machine vs. ClusterMost Big Data computations are I/O-boundSingle machine: disk bandwidth + seek latencyDistributed memory: network bandwidth + network latencyComplexity / challenges:Single machine: algorithms and data structures that reduce random accessDistributed: admin, coordination, consistency, fault toleranceTotal costProgrammer productivitySpecialized vs. Generalized frameworks

Single machine systems are easy to programBut currently need specialized solutions while if you use Hadoop etc., you can use same framework for wide variety of problems36Unified Recsys Platform for GraphChi?Working with masters students at CMU.Goal: ability to easily compare different algorithms, parameters Unified input, output.General programmable API (not just file-based)Evaluation process: Several evaluation metrics; Cross-validation, held-out dataRun many algorithm instances in parallel, on same graph.Java.Scalable from the get-go.

Recent developments: Disk-based Graph ComputationRecently two disk-based graph computation systems published:TurboGraph (KDD13)X-Stream (SOSP13 in October)Significantly better performance than GraphChi on many problemsAvoid preprocessing (sharding)But GraphChi can do some computation that X-Stream cannot (triangle counting and related); TurboGraph requires SSDHot research area!

Do you need GraphChi or any system?Heck, for many algorithms, you can just mmap() over your (binary) adjacency list / sparse matrix, and write a for-loop.See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC (Big Data 13)Obviously good to have a common APIAnd some algos need more advanced solutions (like GraphChi, X-Stream, TurboGraph)Beware of the hype!ConclusionVery large recommender algorithms can now be run on just your PC or laptop.Additional performance from multi-core parallelism.Great for productivity scale by replicating.In general, good single machine scalability requires care with data structures, memory management natural with C/C++, with Java (etc.) need low-level byte massaging.Frameworks like GraphChi hide the low-level.More work needed to productize current work.

Thank you!

Aapo KyrlPh.D. candidate @ CMU soon to graduate! (Currently visiting U.W)

http://www.cs.cmu.edu/~akyrolaTwitter: @kyrpov43

large-scale recommender systems on just a pc

Documents