page: a partition aware engine for parallel graph computationmalin199/publications/2015.page... ·...

14
1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia Shao, Bin Cui, Senior Member, IEEE and Lin Ma Abstract—Graph partition quality affects the overall performance of parallel graph computation systems. The quality of a graph partition is measured by the balance factor and edge cut ratio. A balanced graph partition with small edge cut ratio is generally preferred since it reduces the expensive network communication cost. However, according to an empirical study on Giraph, the performance over well partitioned graph might be even two times worse than simple random partitions. This is because these systems only optimize for the simple partition strategies and cannot efficiently handle the increasing workload of local message processing when a high quality graph partition is used. In this paper, we propose a novel partition aware graph computation engine named PAGE, which equips a new message processor and a dynamic concurrency control model. The new message processor concurrently processes local and remote messages in a unified way. The dynamic model adaptively adjusts the concurrency of the processor based on the online statistics. The experimental evaluation demonstrates the superiority of PAGE over the graph partitions with various qualities. Index Terms—Graph Computation, Graph Partition, Message Processing 1 I NTRODUCTION Massive big graphs are prevalent nowadays. Promi- nent examples include web graphs, social networks and other interactive networks in bioinformatics. The up to date web graph contains billions of nodes and trillions of edges. Graph structure can represent vari- ous relationships between objects, and better models complex data scenarios. The graph-based processing can facilitate lots of important applications, such as linkage analysis [8], [18], community discovery [20], pattern matching [22] and machine learning factoriza- tion models [3]. With these stunning growths of a variety of large graphs and diverse applications, parallel processing becomes the de facto graph computing paradigm for current large scale graph analysis. A lot of parallel graph computation systems have been introduced, e.g., Pregel, Giraph, GPS and GraphLab [23], [1], [27], [21]. These systems follow the vertex-centric program- ming model. The graph algorithms in them are split into several supersteps by synchronization barriers. In a superstep, each active vertex simultaneously up- dates its status based on the neighbors’ messages from previous superstep, and then sends the new status as a message to its neighbors. With the limited workers (computing nodes) in practice, a worker usually stores a subgraph, not a vertex, at local, and sequentially executes the local active vertices. The computations of these workers are in parallel. Yingxia Shao, Bin Cui (corresponding author) and Lin Ma are with the Key Lab of High Confidence Software Technologies (MOE), School of EECS, Peking University, China E-mail: {simon0227, bin.cui, malin1993ml}@pku.edu.cn Therefore, graph partition is one of key compo- nents that affect the graph computing performance. It splits the original graph into several subgraphs, such that these subgraphs are of about the same size and there are few edges between separated subgraphs. A graph partition with high quality indicates there are few edges connecting different subgraphs while each subgraph is in similar size. The ratio of the edges crossing different subgraphs to the total edges is called edge cut (ratio). A good balanced partition (or high quality partition) usually has a small edge cut and helps improve the performance of systems. Because the small edge cut reduces the expensive communication cost between different subgraphs, and the balance property generally guarantees that each subgraph has similar computation workload. Scheme Edge Cut Random 98.52% LDG1 82.88% LDG2 75.69% LDG3 66.37% LDG4 56.34% METIS 3.48% (a) Partition quality (b) Computing performance Fig. 1. PageRank on various web graph 1 partitions However, in practice, a good balanced graph par- tition even leads to a decrease of the overall per- formance in existing systems. Figure 1 shows the performance of PageRank algorithm on six different 1. uk-2007-05-u, http://law.di.unimi.it/datasets.php,please refer to the detailed experiment setup in Section 6.

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

1

PAGE: A Partition Aware Engine forParallel Graph Computation

Yingxia Shao, Bin Cui, Senior Member, IEEE and Lin Ma

Abstract—Graph partition quality affects the overall performance of parallel graph computation systems. The quality of a graphpartition is measured by the balance factor and edge cut ratio. A balanced graph partition with small edge cut ratio is generallypreferred since it reduces the expensive network communication cost. However, according to an empirical study on Giraph, theperformance over well partitioned graph might be even two times worse than simple random partitions. This is because thesesystems only optimize for the simple partition strategies and cannot efficiently handle the increasing workload of local messageprocessing when a high quality graph partition is used. In this paper, we propose a novel partition aware graph computationengine named PAGE, which equips a new message processor and a dynamic concurrency control model. The new messageprocessor concurrently processes local and remote messages in a unified way. The dynamic model adaptively adjusts theconcurrency of the processor based on the online statistics. The experimental evaluation demonstrates the superiority of PAGEover the graph partitions with various qualities.

Index Terms—Graph Computation, Graph Partition, Message Processing

F

1 INTRODUCTION

Massive big graphs are prevalent nowadays. Promi-nent examples include web graphs, social networksand other interactive networks in bioinformatics. Theup to date web graph contains billions of nodes andtrillions of edges. Graph structure can represent vari-ous relationships between objects, and better modelscomplex data scenarios. The graph-based processingcan facilitate lots of important applications, such aslinkage analysis [8], [18], community discovery [20],pattern matching [22] and machine learning factoriza-tion models [3].

With these stunning growths of a variety of largegraphs and diverse applications, parallel processingbecomes the de facto graph computing paradigm forcurrent large scale graph analysis. A lot of parallelgraph computation systems have been introduced,e.g., Pregel, Giraph, GPS and GraphLab [23], [1], [27],[21]. These systems follow the vertex-centric program-ming model. The graph algorithms in them are splitinto several supersteps by synchronization barriers.In a superstep, each active vertex simultaneously up-dates its status based on the neighbors’ messages fromprevious superstep, and then sends the new status asa message to its neighbors. With the limited workers(computing nodes) in practice, a worker usually storesa subgraph, not a vertex, at local, and sequentiallyexecutes the local active vertices. The computationsof these workers are in parallel.

• Yingxia Shao, Bin Cui (corresponding author) and Lin Ma are withthe Key Lab of High Confidence Software Technologies (MOE), Schoolof EECS, Peking University, ChinaE-mail: {simon0227, bin.cui, malin1993ml}@pku.edu.cn

Therefore, graph partition is one of key compo-nents that affect the graph computing performance. Itsplits the original graph into several subgraphs, suchthat these subgraphs are of about the same size andthere are few edges between separated subgraphs.A graph partition with high quality indicates thereare few edges connecting different subgraphs whileeach subgraph is in similar size. The ratio of theedges crossing different subgraphs to the total edgesis called edge cut (ratio). A good balanced partition(or high quality partition) usually has a small edgecut and helps improve the performance of systems.Because the small edge cut reduces the expensivecommunication cost between different subgraphs, andthe balance property generally guarantees that eachsubgraph has similar computation workload.

Scheme Edge CutRandom 98.52%LDG1 82.88%LDG2 75.69%LDG3 66.37%LDG4 56.34%METIS 3.48%

(a) Partition qualityRandomLDG1 LDG2 LDG3 LDG4METIS

Partition Scheme

0

10

20

30

40

50

60

70

80

Aver

age

time(

s/ite

ratio

n) overall costsync remote comm. costlocal comm. cost

(b) Computing performance

Fig. 1. PageRank on various web graph1partitions

However, in practice, a good balanced graph par-tition even leads to a decrease of the overall per-formance in existing systems. Figure 1 shows theperformance of PageRank algorithm on six different

1. uk-2007-05-u, http://law.di.unimi.it/datasets.php,please referto the detailed experiment setup in Section 6.

Page 2: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

2

partition schemes of a large web graph dataset, andapparently the overall cost of PageRank per iterationincreases with the quality improvement of differentgraph partitions. As an example, when the edge cutratio is about 3.48% in METIS, the performance isabout two times worse than that in simple randompartition scheme where edge cut is 98.52%. It indicatesthat the parallel graph system may not benefit fromthe high quality graph partition.

Figure 1(b) also lists local communication cost andsync remote communication cost (explained in Sec-tion 2). We can see that, when the edge cut ratiodecreases, the sync remote communication cost is re-duced as expected. However, the local communicationcost increases fast unexpectedly, which directly leadsto the downgrade of overall performance. This ab-normal outcome implies the local message processingbecomes a bottleneck in the system and dominatesthe overall cost when the workload of local messageprocessing increases.

Lots of existing parallel graph systems are unawareof such effect of the underlying partitioned subgraphs,and ignore the increasing workload of local messageprocessing when the quality of partition scheme isimproved. Therefore, these systems handle the localmessages and remote messages unequally and onlyoptimize the processing of remote messages. Thoughthere is a simple extension of centralized messagebuffer used to process local and remote incoming mes-sages all together [27], the existing graph systems stillcannot effectively utilize the benefit of high qualitygraph partitions.

In this paper, we present a novel graph computa-tion engine, Partition Aware Graph computation En-gine (PAGE). To efficiently support computation taskswith different partitioning qualities, we develop someunique components in this new framework. First, inPAGE’s worker, communication module is extendedwith a new dual concurrent message processor. Themessage processor concurrently handles both localand remote incoming messages in a unified way, thusaccelerating the message processing. Furthermore, theconcurrency of the message processor is tunable ac-cording to the online statistics of the system. Second,a partition aware module is added in each worker tomonitor the partition-related characters and adjust theconcurrency of the message processor adaptively to fitthe online workload.

To fulfill the goal of estimating a reasonable concur-rency for the dual concurrent message processor, weintroduce the Dynamic Concurrency Control Model(DCCM). Since the message processing pipeline sat-isfied the prodcuer-consumer model, several heuris-tic rules are proposed by considering the producer-consumer constraints. With the help of DCCM, PAGEprovides sufficient message process units to handlecurrent workload and each message process unit hassimilar workload. Finally, PAGE can adaptively accept

various qualities of the integrated graph partition.A prototype of PAGE has been set up on top of

Giraph (version 0.2.0). The experiment results demon-strate that the optimizations in PAGE can enhancethe performance of both stationary and non-stationarygraph algorithms on graph partitions with variousqualities.

The main contributions of our work are summa-rized as follows.

• We propose the problem that existing graph com-putation systems cannot efficiently exploit thebenefit of high quality graph partitions.

• We design a new partition aware graph compu-tation engine, called PAGE. It can effectively har-ness the partition information to guide parallelprocessing resource allocation, and improve thecomputation performance.

• We introduce a dual concurrent message pro-cessor. The message processor concurrently pro-cesses incoming messages in a unified way andis the cornerstone in PAGE.

• We present a dynamic concurrency controlmodel. The model estimates concurrency for dualconcurrent message processor by satisfying theproducer-consumer constraints. The model al-ways generate proper configurations for PAGEwhen the graph applications or underlying graphpartitions change.

• We implement a prototype of PAGE and test itwith real-world graphs and various graph al-gorithms. The results clearly demonstrate thatPAGE performs efficiently over various graphpartitioning qualities.

This paper extends a preliminary work [32] inthe following aspects. First, we detailed analyze therelationship among the workload of message process-ing, graph algorithms and graph partition. Second,technical specifics behind the dynamic concurrencycontrol model are analyzed clearly. Third, the practicaldynamic concurrency control model, which estimatesthe concurrency in incremental fashion, is discussed.Fourth, to show the effectiveness of DCCM, the com-parison experiment between DCCM and manual tun-ing are conducted. Fifth, to show the advantage andgenerality of PAGE, more graph algorithms, such asdiameter estimator, breadth first search, single sourceshortest path, are ran on PAGE.

The remaining paper is organized as follows. Sec-tion 2 discusses the workload of message processingin graph computation systems. We introduce PAGE’sframework in Section 3. Sections 4 and 5 elaborate thedual concurrent message processor and dynamic con-currency control model respectively. The experimentalresults are shown in Section 6. Finally, we review therelated work and conclude the paper in Section 7 andSection 8.

Page 3: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

3

2 THE WORKLOAD OF MESSAGE PROCESS-ING

In Pregel-like graph computation systems, verticesexchange their status through message passing. Whenthe vertex sends a message, the worker first deter-mines whether the destination vertex of the messageis owned by a remote worker or the local worker. Inthe remote case, the message is buffered first. Whenthe buffer size exceeds a certain threshold, the largestone is asynchronously flushed, delivering each to thedestination as a single message. In the local case, themessage is directly placed in the destination vertex’sincoming message queue [23].

Therefore, the communication cost in a singleworker is split into local communication cost andremote communication cost. Combining the computa-tion cost, the overall cost of a worker has three com-ponents. Computation cost, denoted by tcomp, is thecost of execution of vertex programs. Local communi-cation cost, denoted by tcomml, represents the cost ofprocessing messages generated by the worker itself.Remote communication cost, denoted by tcommr, in-cludes the cost of sending messages to other workersand waiting for them processed. In this paper, weuse the cost of processing remote incoming messagesat local to approximate the remote communicationcost. There are two reasons for adopting such anapproximation. First, the difference between two costsis the network transferring cost, which is relativelysmall compared to remote message processing cost.Second, the waiting cost of the remote communicationcost is dominated by the remote message processingcost.

The workload of local (remote) message processingdetermines the local (remote) communication cost.The graph partition influences the workload distribu-tion of local and remote message processing. A highquality graph partition, which is balanced and hassmall edge cut ratio, usually leads to the local mes-sage processing workload is higher than the remotemessage processing workload, and vice versa.

2.1 The Influence of Graph Algorithms

In reality, besides the graph partition, the actual work-load of message processing in an execution instanceis related to the characteristics of graph algorithms aswell.

Here we follow the graph algorithm category in-troduced in [17]. On basis of the communicationcharacteristics of graph algorithms when running ona vertex-centric graph computation system, they areclassified into stationary graph algorithms and non-stationary graph algorithms. The stationary graphalgorithms have the feature that all vertices send andreceive the same distribution of messages betweensupersteps, like PageRank, Diameter Estimation [14].

In contrast, the destination or size of the outgo-ing messages changes across supersteps in the non-stationary graph algorithms. For example, traversal-based graph algorithms, e.g., Breadth First Search andSingle Source Shortest Path, are the non-stationaryones.

In the stationary graph algorithms, every vertexhas the same behavior during the execution, so theworkload of message processing only depends onthe underlying graph partition. When a high qualitygraph partition is applied, the local message process-ing workload is higher than the remote one, andvice versa. The high quality graph partition helpsimprove the overall performance of stationary graphalgorithms, since processing local messages is cheaperthan processing remote messages, which has a net-work transferring cost.

For the traversal-based graph algorithms belongingto the non-stationary category, it is also true thatthe local message processing workload is higher thanthe remote one when a high quality graph partitionis applied. Because the high quality graph partitionalways clusters a dense subgraph together, which istraversed in successive supersteps. However, the highquality graph partition cannot guarantee a better over-all performance for the non-stationary ones, becauseof the workload imbalance of the algorithm itself. Thisproblem can be solved by techniques in [17], [33].

In this paper, we focus on the efficiency of a workerwhen different quality graph partitions are applied.The systems finally achieve better performance byimproving the performance of each worker and leavethe workload imbalance to the dynamic repartitionsolutions. The next subsection will reveal the draw-back in the existing systems when handling differentquality graph partitions.

comp. comm.

tsyncr

overall cost

tcomml

tcommr

tcomp

(a) Combination in Giraph

comp. comm.

tsyncr

tsyncl

overall cost

tcomml

tcommr

tcomp

(b) Combination in PAGE

Fig. 2. Different combinations of computation cost,local/remote communication cost. The arrows indicatethe ingredients of overall cost.

2.2 The Cost of a Worker in Pregel-like Systems

As mentioned before, the cost of a worker has threecomponents. Under different designs of the commu-nication module, there are several combinations ofabove three components to determine the overall cost

Page 4: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

4

PAGE worker1

Partition

AwareComm.

PAGE worker2

Partition

AwareComm.

PAGE worker3

Partition

AwareComm.

Distributed In-Memory Partitioned Graph

Computation Computation Computation

(a) Architecture

Partition

Aware

Monitor

DCCM

Communication

Dual Concurrent MP

Sender Receiver

Computation

Remote

MP

Local

MP

PAGE Worker

(b) Partition aware worker

Fig. 3. The framework of PAGE

of a worker. Figure 2 lists two possible combinationsand illustrates fine-grained cost ingredients as well.Components in a single bar mean that their costsare additive because of the sequential processing.The overall cost equals the highest one among theseindependent bars.

The cost combination in Giraph is illustrated inFigure 2(a). The computation cost and local com-munication cost are in the same bar, as Giraph di-rectly processes local message processing during thecomputation. The sync remote communication cost,tsyncr = tcommr − tcomp − tcomml, is the cost ofwaiting for the remote incoming message process-ing to be accomplished after the computation andlocal message processing finished. This type of thecombination processes local incoming messages andremote incoming messages unequally, and the compu-tation might be blocked by processing local incomingmessages. When the workload of processing localincoming messages increases, the performance of aworker degrades severely. This is the main cause thatGiraph suffers from a good balanced graph partitionwhich is presented in Section 1.

3 PAGE

PAGE, which stands for Partition Aware Graph com-putation Engine, is designed to support differentgraph partition qualities and maintain high perfor-mance by an adaptively tuning mechanism and newcooperation methods. Figure 3(a) illustrates the archi-tecture of PAGE. Similar to the majority of existingparallel graph computation systems, PAGE followsthe master-worker paradigm. The computing graph ispartitioned and distributively stored among workers’memory. The master is responsible for aggregatingglobal statistics and coordinating global synchroniza-tion. The novel worker in Figure 3(b) is equipped withan enhanced communication module and a newly in-troduced partition aware module. Thus the workers in

PAGE can be aware of the underlying graph partitioninformation and optimize the graph computation task.

3.1 Overview of two modules

The enhanced communication module in PAGE in-tegrates a dual concurrent message processor, whichseparately processes local and remote incoming mes-sages, and allows the system to concurrently processthe incoming messages in a unified way. The concur-rency of dual concurrent message processor can beadjusted by the partition aware model online, to fit therealtime incoming message processing workload. Thedetailed mechanism of the dual concurrent messageprocessor will be described in Section 4.

The partition aware module contains two key com-ponents: a monitor and a Dynamic Concurrency Con-trol Model (DCCM). The monitor is used to maintainnecessary metrics and provide these information tothe DCCM. The DCCM generates appropriate param-eters through an estimation model and changes theconcurrency of dual concurrent message processor.The details of monitor and DCCM will be presentedin Section 5.

With the help of the enhanced communication mod-ule and the partition aware module, PAGE can dy-namically tune the concurrency of message processorfor local and remote message processing with light-weight overhead, and provide enough message pro-cess units to run itself fluently on graph partition withdifferent qualities.

3.2 Graph Algorithm Execution in PAGE

The main execution flow of graph computation inPAGE is similar to the other Pregel-like systems.However, since PAGE integrates the partition awaremodule, there exist some extra works in each su-perstep and the modified procedure is illustrated inAlgorithm 1. At the beginning, the DCCM in partitionaware module calculates suitable parameters based on

Page 5: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

5

metrics from previous superstep, and then updatesthe configurations (e.g., concurrency, assignment strat-egy) of dual concurrent message processor. During asuperstep, the monitor tracks the related statistics ofkey metrics in the background. The monitor updateskey metrics according to these collected statistics andfeeds up to date values of the metrics to the DCCMat the end of each superstep.

Note that PAGE will reconfigure the message pro-cessor in every superstep in case of the non-stationarygraph algorithms. As discussed in Section 2, the qual-ity of underlying graph partition may change betweensupersteps for the non-stationary graph algorithmswith dynamic workload balance strategy. Even if thestatic graph partition strategy is used, the variableworkload characteristics of non-stationary graph algo-rithms require the reconfigurations across supersteps.In summary, PAGE can adapt to different workloadscaused by the underlying graph partition and thegraph algorithm itself.

Algorithm 1 Procedure of a superstep in PAGE1: DCCM reconfigures dual concurrent message processor

parameter.2: foreach active vertex v in partition P do3: call vertex program of v;4: send messages to the neighborhood of v;5: /* monitor tracks related statistics in the

background. */6: end foreach7: synchronization barrier8: monitor updates key metrics, and feeds to the DCCM

4 DUAL CONCURRENT MESSAGE PROCES-SOR

The dual concurrent message processor is the core ofthe enhanced communication model, and it concur-rently processes local and remote incoming messagesin a unified way. With proper configurations for thisnew message processor, PAGE can efficiently dealwith incoming messages over various graph partitionswith different qualities.

Message Process Unit

msg.

msg.

msg.

Message Block

msg.

msg.

msg.

msg.

msg.

Header

msg.

msg.

msg.

msg.

msg.

Fig. 4. Message processing pipeline in PAGE

As discussed in Section 2, messages are deliveredin block, because the network communication is anexpensive operation [10]. But this optimization raisesextra overhead in terms that when a worker receives

incoming message blocks, it needs to parse them anddispatches extracted messages to the specific vertex’smessage queue. In PAGE, the message process unitis responsible for this extra overhead, and it is a min-imal independent process unit in the communicationmodule. A remote (local) message process unit onlyprocesses remote (local) incoming message blocks.The message processor is a collection of messageprocess units. The remote (local) message processoronly consists of remote (local) message process units.Figure 4 illustrates the pipeline that the messageprocess unit handles the overhead.

According to the cost analysis in Section 2.2, wecan see that a good solution is to decouple the localcommunication cost from the computation cost, as thecomputation will not be blocked by any communica-tion operation. Besides, the communication modulecan take over both local and remote communications,which makes it possible to process local and remotemessages in a unified way. Furthermore, the incomingmessage blocks are concurrently received from bothlocal and remote sources. It is better to process thelocal and remote incoming messages separately. Thesetwo observations help us design a novel messageprocessor, which consists of a local message processorand a remote message processor. This novel mes-sage processor design leads to the cost combinationin Figure 2(b). The sync local communication cost,tsyncl = tcomml − tcommr, is similar to the sync remotecommunication cost. It is the cost of waiting for the lo-cal incoming message processing to be accomplishedafter all the remote messages have been processed.

Moreover, in parallel graph computation systems,the incoming messages are finally appended to thevertex’s message queue, so different vertices can beeasily updated concurrently. Taking this factor intoconsideration, we deploy the concurrent message pro-cess units at the internal of local and remote messageprocessor. Therefore, both local and remote messageprocessors can concurrently process incoming mes-sage blocks, and local and remote incoming messagesare processed in a unified way.

In summary, the new message processor consists ofa local and a remote message processor respectively.This is the first type of the concurrency in the messageprocessor. The second type of the concurrency is atthe internal of local and remote message processor.This explains the reason we name this new messageprocessor as dual concurrent message processor.

5 DYNAMIC CONCURRENCY CONTROLMODEL

The concurrency of dual concurrent message proces-sor heavily affects the performance of PAGE. Butit is expensive and also challenging to determine areasonable concurrency ahead of real execution with-out any assumption [25]. Therefore, PAGE needs a

Page 6: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

6

mechanism to adaptively tune the concurrency of thedual concurrent message processor. The mechanism isnamed Dynamic Concurrency Control Model, DCCMfor short.

In PAGE, the concurrency control problem can bemodeled as a typical producer-consumer schedul-ing problem, where the computation phase generatesmessages as a producer, and message process unitsin the dual concurrent message processor are theconsumers. Therefore, the producer-consumer con-straints [29] should be satisfied when solving theconcurrency control problem.

For the PAGE situation, the concurrency controlproblem arises consumer constraints. Since the be-havior of producers is determined by the graph algo-rithms, PAGE only requires to adjust the consumers tosatisfy the constraints (behavior of graph algorithms),which are stated as follows.

First, PAGE provides sufficient message processunits to make sure that new incoming message blockscan be processed immediately and do not block thewhole system. Meanwhile, no message process unit isidle.

Second, the assignment strategy of these messageprocess units ensures that each local/remote messageprocess unit has balanced workload since the dispar-ity can seriously destroy the overall performance ofparallel processing.

Above requirements derive two heuristic rules:

Rule 1: Ability lower-bound: the message processingability of all the message process units shouldbe no less than the total workload of messageprocessing.

Rule 2: Workload balance ratio: the assignment oftotal message process units should satisfy theworkload ratio between local and remote mes-sage processing.

Following subsections first introduce the mathe-matical formulation of DCCM, and then discuss howDCCM mitigates the influences of various graph par-tition qualities. At last, we present the implementationof DCCM. Table 1 summaries the frequently usednotations in the following analysis.

5.1 Mathematical Formulation of DCCM

In PAGE, DCCM uses a set of general heuristic rulesto determine the concurrency of dual concurrent mes-sage processor. The workload of message processingis the number of incoming messages, and it can beestimated by the incoming speed of messages. Herewe use sl and sr to denote the incoming speed oflocal messages and the incoming speed of remotemessages, respectively. The ability of a single messageprocessing unit is the speed of processing incomingmessages, sp.

Symbols Descriptioner Edge cut ratio of a local partition.p Quality of network transfer.sp Message processing speed.sg Message generating speed.srg Speed of remote messages generationsl Incoming speed of local messages.sr Incoming speed of remote messages.nmp Number of message process units.nrmp Number of remote message process units.nlmp Number of local message process units.tcomp Computation cost.tcomm Communication cost.tsycnr Cost of syncing remote communication.tsycnl Cost of syncing local communication.

TABLE 1Frequently used notations

On basis of aforementioned two heuristic rules, thefollowing equations must hold.

nmp × sp ≥ sl + sr, Rule 1nlmp

nrmp=

slsr

. Rule 2 (1)

where nmp stands for the total number of mes-sage process units, nlmp represents the number oflocal message process units, and nrmp is the num-ber of remote message process unit. Meanwhile,nmp=nlmp+nrmp.

Solving Equation 1 yields

nlmp ≥ slsp

,

nrmp ≥ srsp

.(2)

Finally, DCCM can estimate the concurrency of localmessage processor and the concurrency of remotemessage processor separately, if the metrics sl, sr, spare known. The optimal concurrency reaches whenDCCM sets nlmp=⌈ sl

sp⌉ and nrmp=⌈ sr

sp⌉. Because, at this

point, PAGE can provide sufficient message processunits, and it consumes minimal resources as well.

5.2 Adaptiveness on Various Graph Partitions

Section 2 has analyzed that the workload of mes-sage processing is related to both graph partitionand graph algorithms. In this section, we explainthe reason that previous DCCM can adaptively tunethe concurrency of dual concurrent message proces-sor when the underlying graph partition quality ischanged.

Before the detailed discussion, we first give threemetrics.

1) Edge cut ratio of a local partition. It is theratio between cross edges and total edges in alocal graph partition. It is denoted as er. Thismetric judges the quality of graph partitioning

Page 7: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

7

in a worker. The higher ratio indicates the lowerquality.

2) Message generating speed. It represents theoverall generating velocity of all outgoing mes-sages in a worker. This metric implies the totalworkload for a worker. We denote it as sg.

3) Quality of network transfer. This reflects thedegree of network influence to the message gen-eration speed sg . When the generated messagesare sent to a remote worker, the speed of gener-ated messages is cut-off in the view of the remoteworker. This is caused by the factor that networkI/O operation is slower than local operation.The quality is denoted as p ∈ (0, 1). The largerp indicates the better network environment. Inaddition, we can define the equivalent speed ofremote messages’ generation as srg = sg × p.

Now we proceed to reveal the relationship betweenDCCM and the underlying graph partition. Followinganalysis is based on the assumption that the station-ary graph algorithms are ran on a certain partition.Because stationary graph algorithms have predictablecommunication feature.

The local incoming messages are the one whosesource vertex and destination vertex belong to thesame partition. Thus, the incoming speed of localmessage, sl, is the same as sg × (1− er), which standsfor the generating speed of local messages. Similarly,sr equals srg × er. Then Equation 1 can be rewritedas

nmp × sp = sg × (1− er) + srg × er,

nlmp

nrmp=

sg × (1− er)

srg × er=

1− er

p× er.

(3)

Solving nmp, nlmp, nrmp from Equations 3, we canhave the following results

nmp =sgsp

(1− (1− p)× er), (4)

nlmp = nmp ×1− er

p× er + (1− er), (5)

nrmp = nmp ×p× er

p× er + (1− er). (6)

From Equations 4 5 6, we have following obser-vations that indicate correlated relationships betweengraph partition and the behavior of DCCM whenrunning stationary graph algorithms on PAGE.

First, PAGE needs more message process units withthe quality growth of graph partitioning, but theupper bound still exists. This is derived from thefact that, in Equation 4, the nmp increases while erdecreases, since the p is fixed in a certain environment.However, the conditions, 0 < p < 1 and 0 < er < 1,always hold, so that nmp will not exceed sg/sp. Ac-tually, not only the parameters sg , sp dominate theupper bound of total message process units, but also

p heavily affects the accurate total number of mes-sage process units under various partitioning qualityduring execution.

The accurate total number of message process unitsis mainly affected by sg and sp, while er only matterswhen the network is really in low quality. Usuallyin a high-end network environment where p is large,the term (1 − p) × er is negligible in spite of er, andthen the status of whole system (sg , sp) determines thetotal number of message process units. Only in somespecific low-end network environments, the graphpartitioning quality will severely affect the decisionof total number of message process units.

Unlike the total number of message process units,the assignment strategy is really sensitive to the pa-rameter er. From Equation 3, we can see that theassignment strategy is heavily affected by (1-er)/er, asthe value of p is generally fixed for a certain network.Lots of existing systems, like Giraph, do not payenough attention to this phenomenon and suffer fromhigh quality graph partitions. Our DCCM can easilyavoid the problem by handling online assignmentbased on Equations 5 and 6.

Finally, when the non-stationary graph algorithmsare ran on PAGE, the graph partition influence to theDCCM is similar as before. The difference is that theedge cut ratio of a local partition is only a hint, not theexact ratio for local and remote incoming message dis-tribution. Because the unpredictable communicationfeatures of non-stationary graph algorithms cannotguarantee that a lower er leads to higher workload oflocal message processing. However it does for manyapplications in reality, such as traversal-based graphalgorithms.

5.3 Implementation of DCCMGiven the heuristic rules and characteristic discus-sion of the DCCM, we proceed to present its im-plementation issues within the PAGE framework. Toincorporate the DCCM’s estimation model, PAGE isrequired to equip a monitor to collect necessary in-formation in an online way. Generally, the monitorneeds to maintain three high-level key metrics: sp, sl,sr. However, there is a problem that it is difficult tomeasure accurate sp. Because the incoming messageblocks are not continuous and the granularity of timein the operating system is not precise enough, it ishard for the DCCM to obtain an accurate time cost ofprocessing these message blocks. In the end, it leadsto an inaccurate sp.

Therefore, we introduce a DCCM in incrementalfashion based on the original DCCM. Recall the costanalysis in Section 2, we can materialize the workloadthrough multiplying the speed and the correspondingcost. Equation 2 can be represented as follows (use ’=’instead of ’≥’).

sl × tcomp = sp × n,lmp × (tcomp + tsyncr + tsyncl) (7)

Page 8: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

8

sr × tcomp = sp × n,rmp × (tcomp + tsyncr) (8)

where n,lmp and n,

rmp are the concurrency of localand remote message processor in current superstep,respectively.

The estimated concurrency of the local and remotemessage processor for the next superstep can be

nlmp =slsp

= n,lmp × (1 +

tsyncr + tsyncltcomp

) (9)

nrmp =srsp

= n,rmp × (1 +

tsyncrtcomp

) (10)

Finally, the new DCCM can estimate the latestnlmp and nrmp based on the previous values andthe corresponding time cost tcomp, tsyncl, tsyncr. Themonitor is only responsible to track three cost-relatedmetrics. As the monitor just records the time costwithout any additional data structures or complicatedstatistics, it brings the negligible overhead. In ourPAGE prototype system implementation, we applythe DCCM with incremental fashion to automaticallydetermine the concurrency of the system.

5.4 Extension to Other Data Processing Scenar-iosAlthough the initial goal of dual concurrent messageprocessor and dynamic concurrency control model isto help graph computation system efficiently handlemessages and be partition aware, we found that thetechniques can be beneficial to other data processingscenarios as well.

First, as a graph computation system, PAGE canalso efficiently handle other structured data on whichthe problem can be remodeled as graph. For examplesparse matrix-vector multiplication can be remodeledas a vertex-centric aggregation operator in a graphwhose adjacent matrix equals the given sparse matrixand vertex values are from the given vector [26], thenthe techniques in PAGE will improve the performanceof sparse matrix-vector multiplication under the cor-responding graph model.

Second, the techniques in PAGE can also be ex-tended to enhance other big data processing plat-form which satisfies the producer-consumer model.Distributed stream computing platform (e.g., YahooS4 [24], Twitter Storm [2]) is one kind of popularplatforms to process real-time big data. In Storm, thebasic primitives for doing stream transformations are“spouts” and “bolts”. A spout is a source of streams. Abolt consumes any number of input streams, conductssome processing, and possibly emits new streams.Given an example of word count for a large documentset, the core bolt keeps a map in memory from wordto count. Each time it receives a word, it updatesits state and emits the new word count. In practice,the parallelism of a blot is specified by user and

this is inflexible. Since the logic of a blot satisfiesthe producer-consumer model, by designing an esti-mation model that is similar to DCCM, the streamcomputing platform can also automatically adjust theproper number of processing elements according tothe workload.

6 EMPIRICAL STUDIES

We have implemented the PAGE prototype on top ofan open source Pregel-like graph computation system,Giraph [1]. To test its performance, we conductedextensive experiments and demonstrated the superi-ority of our proposal. The following section describesthe experimental environment, datasets, baselines andevaluation metrics. The detailed experiments evaluatethe effectiveness of DCCM, the advantages of PAGEcompared with other methods and the performanceof PGAE on various graph algorithms.

6.1 Experimental setup

Graph Vertices Edges Directeduk-2007-05 105,896,555 3,738,733,648 yes

uk-2007-05-u 105,153,952 6,603,753,128 nolivejournal 4,847,571 68,993,773 yes

livejournal-u 4,846,609 85,702,474 no

TABLE 2Graph dataset information

All experiments are ran on a cluster of 24 nodes,where each physical node has an AMD Opteron4180 2.6Ghz CPU, 48GB memory and a 10TB diskRAID. Nodes are connected by 1Gbt routers. Twograph datasets are used: uk-2007-05-u [7], [6] andlivejournal-u [5], [19]. The uk-2007-05-u is a webgraph, while livejournal-u is a social graph. Bothgraphs are undirected ones created from the originalrelease by adding reciprocal edges and eliminatingloops and isolated nodes. Table 2 summarizes themeta-data of these datasets with both directed andundirected versions.

Graph partition scheme. We partition the largegraphs with three strategies: Random, METIS and Lin-ear Deterministic Greedy (LDG). METIS [16] is a pop-ular off-line graph partition packages, and LDG [30] isa well-known stream-based graph partition solution.The uk-2007-05-u graph is partitioned into 60 sub-graphs, and livejournal-u graph is partitioned into 2,4, 8, 16, 32, 64 partitions, respectively. Balance factorsof all these partitions do not exceed 1%, and edgecut ratios are list in Figure 1(a) and Table 3. Theparameter setting of METIS is the same as METIS-balanced approach in GPS [27].

Furthermore, in order to generate various partitionqualities of a graph, we extend the original LDG

Page 9: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

9

algorithm to an iterative version. The iterative LDGpartitions the graph based on previous partitionresult, and gradually improves the partition qualityin every following iteration. We name the partitionresult from iterative LDG as LDGid, where a largerid indicates the higher quality of graph partitioningand the more iterations executed.

Baselines. Throughout all experiments, we use twobaselines for the comparison with PAGE.

The first one is Giraph. However, as we noticefrom Figure 2(a) that the local message processing andcomputation run serially in the default Giraph. Thiscost combination model is inconsistent with our eval-uation. We modify it to asynchronously process thelocal messages, so that Giraph can concurrently runcomputation, local message processing and remotemessage processing. In the following experiments, Gi-raph refers to this modified Giraph version. Note that,this modification will not decrease the performance ofGiraph.

The other one is derived from the technique used inGPS. One optimization in GPS, applies a centralizedmessage buffer and sequentially processes incomingmessages without synchronizing operations, whichdecouples the local message processing from thecomputation and treats the local and remote messageequivalently. We implement this optimization on theoriginal Giraph and denote it as Giraph-GPSop.

Metrics for evaluation. We use the following met-rics to evaluate the performance of a graph algorithmon a graph computation system.

• Overall cost. It indicates the whole executiontime of a graph algorithm when running on acomputation system. Due to the property of con-current computation and communication model,this metric is generally determined by the slowerone between the computation and communica-tion.

• Sync remote communication cost. It presentsthe cost of waiting for all I/O operations to beaccomplished successfully after the computationfinished. This metric reveals the performance ofremote message processing.

• Sync local communication cost. It means the costof waiting for all local messages to be processedsuccessfully after syncing remote communication.This metric indicates the performance of localmessage processing.

All three metrics are measured by the average timecost per iteration. The relationship among these met-rics can be referred to Figure 2.

6.2 Evaluation on DCCMDynamic Concurrency Control Model (DCCM) is thekey component of PAGE to determine the concur-rency for dual concurrent message processor, balance

the workload for both remote and local messageprocessing as well, and hence improve the overallperformance. In this section, we demonstrate theeffectiveness of DCCM through first presenting theconcurrency automatically chosen by DCCM basedon its estimation model, and then showing that thesechosen values are close to the manually tuned goodparameters. Finally, we also show DCCM convergesefficiently to estimate a good concurrency for dualconcurrent message processor.

6.2.1 Concurrency chosen by DCCM

Random LDG1LDG2 LDG3

LDG4METIS

Partition Scheme

0

1

2

3

4

5

6

7

8

No.

of

MP

s

n

rmp

n

lmp

Fig. 5. Assignment strategy on different partitionschemes in PAGE.

Figure 5 shows the concurrency of dual concurrentmessage processor automatically determined by theDCCM, when running PageRank on uk-2007-05-ugraph. The variation of concurrency on various graphpartition schemes are consistent with the analysis inSection 5.2.

In Figure 5, there are two meaningful observations.First, the total number of message process units, nmp,is always equal to seven across six different partitionschemes. This is because the cluster has a high-speednetwork and the quality of network transfer, p, is highas well. Second, with the decrease of edge cut ratio(from left to right in Figure 5), the nlmp decided bythe DCCM increases smoothly to handle the grow-ing workload of local messages processing, and theselected nrmp goes oppositely. According to aboveparameters’ variations across different graph partitionschemes, we also conclude that the assignment strat-egy is more sensitive to the edge cut ratio than thetotal message process units nmp.

6.2.2 Results by manual tuningHere we conduct a series of manual parameter tuningexperiments to discover the best configurations forthe dual concurrent message processor when runningPageRank on uk-2007-05-u in practice. Hence, it helpsverify that the parameters determined by DCCM areeffective.

We tune the parameters nlmp and nrmp one by oneon a specific partition scheme. When one variable istuned, the other one is guaranteed to be sufficientlylarge that does not seriously affect the overall perfor-mance. When we conduct the tuning experiment on

Page 10: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

10

1 2 3 4 5 6 7 8 9 10n

rmp

0

20

40

60

80

100

Av

erag

e ti

me(

s/it

erat

ion

) sync remote comm. cost

overall cost

(a) nrmp tuning on Random

1 2 3 4 5 6 7 8 9 10n

lmp

0

10

20

30

40

50

Aver

age

tim

e(s/

iter

atio

n)

sync local comm. cost

overall cost

(b) nlmp tuning on Random

1 2 3 4 5 6 7 8 9 10n

rmp

0

10

20

30

40

Aver

age

tim

e(s/

iter

atio

n)

sync remote comm. cost

overall cost

(c) nrmp tuning on METIS

1 2 3 4 5 6 7 8 9 10n

lmp

0

20

40

60

80

Aver

age

tim

e(s/

iter

atio

n) sycn local comm.cost

overall cost

(d) nlmp tuning on METIS

Fig. 6. Tuning on Random and METIS partition schemes. Black points are the optimal choices.

METIS scheme to get the best number of local messageprocess units (nlmp), for example, we manually pro-vide sufficient remote message process units (nrmp)with the DCCM feature off, and then increase nlmp

for each PageRank execution instance until the overallperformance becomes stable.

Figure 6 shows the tuning results of running PageR-ank on Random and METIS partition schemes respec-tively. We do not list results about LDG1 to LDG4, asthey all lead to the same conclusions as above twoschemes. The basic rule of determining the properconfigurations for a manual tuning is choosing theearliest points where the overall performance is closeto the stable one as the best configuration.

As shown in Figure 6(a), when the number ofremote message process units exceeds five, the over-all performance is converged and changes slightly.This indicates the remote message processor withfive message process units inside is sufficient for thisworkload, which is running PageRank on randompartitioned uk-2007-05-u. Though, the sync remotecommunication cost can still decrease a bit with con-tinuously increasing nrmp. But the large number ofmessage process units will also affect other partsof the system (e.g., consuming more computationresources), the overall performance remains stable be-cause of the tradeoff between two factors (i.e., numberof message processor units and influence on otherparts of the systems).

From Figure 6(b), we can easily figure out onelocal message process unit is sufficient to handlethe remained local message processing workload inRandom scheme. More message process units do notbring any significant improvement. Based on thistuning experiment, we can see that totally six messageprocess units are enough. Among them, one is forthe local message processor, and the other five are forremote message processor.

In Figure 5, the parameters chosen by the DCCMare six message process units for the remote messageprocessor and one for local message processor, whenthe Random partition scheme is applied. Though theydo not exactly match the off-line tuned values, butthey fall into the range where the overall performancehas been converged and the difference is small. So theDCCM generates a set of parameters almost as good

as the best ones acquired from the off-line tuning.By analyzing Figures 6(c) and 6(d), we can see seven

message process units are sufficient, in which five be-long to local message processor and the remained arefor remote message processor. This time the DCCMalso comes out the similar results as the manuallytuned parameters.

Through a bunch of parameter tuning experiments,we verify the effectiveness of the DCCM and find thatthe DCCM can choose appropriate parameters.

0 1 2 3 4 5 6 7 8 9

Iteration

0

20

40

60

80

100

Av

erag

e ti

me(

s/it

erat

ion

)

overall cost

sync remote comm. cost

local comm. cost

(a) Random partition scheme

0 1 2 3 4 5 6 7 8 9

Iteration

0

20

40

60

80

100

Tim

e(s)

overall cost

sync remote comm. cost

local comm. cost

(b) METIS partition scheme

Fig. 7. Performance on adaptively tuning by DCCM

6.2.3 Adaptivity of DCCMIn this subsection, we show the results about theadaptivity on various graph partition scheme inPAGE. All the experimental results show that theDCCM is sensitive to both partition quality and initialconcurrency setting. It responses fast and can adjustPAGE to a better status within a few iterations.

We run the PageRank on Random and METIS graphpartition schemes, with randomly setting the con-currency of the remote message processor and localmessage processor at the beginning, and let DCCMautomatically adjust the concurrency.

Figure 7(b) illustrates each iteration’s performanceof PageRank running on the METIS partition schemeof uk-2007-05-u. It clearly shows that PAGE canrapidly adjust itself to achieve better performancefor the task. It costs about 93 seconds to finish thefirst iteration, where the sync local communicationcost is around 54 seconds. After first iteration, PAGEreconfigures the concurrency of dual concurrent mes-sage processor, and achieves better overall cost insuccessive iterations by speeding up the process oflocal messages. The second iteration takes 44 seconds,and the sync local communication cost is close to

Page 11: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

11

RandomLDG1 LDG2 LDG3 LDG4METIS

Partition Scheme

0

5

10

15

20

25

30

35

Averag

e tim

e(s/ite

ratio

n)

overall costsync remote comm. costsync local comm. cost

(a) Performance on PAGE

RandomLDG1 LDG2 LDG3 LDG4METIS

Partition Scheme

0

10

20

30

40

50

60

70

Average time(s/iteration) overall cost

sync remote comm. costsync local comm. cost

(b) Performance on Giraph

RandomLDG1 LDG2 LDG3 LDG4METIS

Partition Scheme

0

20

40

60

80

Avera

ge t

ime(s

/ite

rati

on

)

overall cost

sync remote comm. cost

sync local comm. cost

(c) Performance on Giraph-GPSop

Fig. 8. PageRank performance on different systems

Random METIS0

20

40

60

80

100

120

140

Aver

age

tim

e(s/

iter

atio

n)

Giraph

Giraph-GPSop

PAGE

(a) PageRankRandom METIS

0

50

100

150

Aver

age

tim

e(s/

iter

atio

n)

Giraph

Giraph-GPSop

PAGE

(b) Diameter EstimationRandom METIS

0

100

200

300

400

500

To

tal

tim

e(s)

Giraph

Giraph-GPSop

PAGE

(c) BFSRandom METIS

0

100

200

300

400

To

tal

tim

e(s)

Giraph

Giraph-GPSop

PAGE

(d) SSSP

Fig. 9. The performance of various graph algorithms

zero already. Similar results can be obtained for theRandom partitioning scheme, shown in Figure 7(a).

6.3 Comparison with Other Methods

In this section, we compare the performance of PAGEwith the Pregel-like baselines, i.e., Giraph and Giraph-GPSop. We first present the advantage of PAGE byprofiling PageRank execution instance, followed bythe evaluation on various graph algorithms. In theend, we show that PAGE can also handle the situationwhere varying the number of partitions leads to thechange of graph partition quality.

6.3.1 Advantage of PAGE

We demonstrate that PAGE can maintain high perfor-mance along with the various graph partition quali-ties. Figures 8(a) and 8(b) describe the PageRank per-formance across various partition schemes on PAGEand Giraph. We find that, with the increasing qualityof graph partitioning, Giraph suffers from the work-load growth of local message processing and the synclocal communication cost rises fast. In contrast, PAGEcan scalably handle the upsurging workload of localmessage processing, and maintain the sync local com-munication cost close to zero. Moreover, the overallperformance of PAGE is actually improved along withincreasing the quality of graph partitioning. In Figure8(a), when the edge cut ratio decreases from 98.52%to 3.48%, the performance is improved by 14% inPAGE. However, in Giraph, the performance is evendowngraded about 100% at the same time.

From Figure 8(c), we notice the Giraph-GPSopachieves better performance with the improving qual-ity of graph partition as PAGE does. But PAGE ismore efficient than Giraph-GPSop over various graphpartitions with different qualities. Comparing withGiraph, PAGE always wins for various graph par-titions, and the improvement ranges from 10% to120%. However, Giraph-GPSop only beats Giraph andgains around 10% improvement over METIS parti-tion scheme which produces a really well partitionedgraph. For Random partition, Giraph-GPSop is evenabout 2.2 times worse than Giraph. It is easy to figureout that the central message buffer in Giraph-GPSopleads to this phenomena, as Figure 8(c) illustratesthat the sync local communication cost2 is around40 seconds, though it is stable across six partitionschemes. Overall, as a partition aware system, PAGEcan well balance the workloads for local and remotemessage processing with different graph partitions.

6.3.2 Performance on Various Graph Algorithms

In this section, we show that PAGE helps improve theperformances of both stationary graph algorithms andnon-stationary graph algorithms. The experiments areran on PAGE, Giraph and Giraph-GPSop with twostationary graph algorithms and two non-stationarygraph algorithms. The two stationary graph algo-rithms are PageRank and diameter estimation, and

2. Due to the central message buffer, we treat all the messagesas local messages and count its cost into the local communicationcost in Giraph-GPSop.

Page 12: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

12

2 4 8 16 32 64Partition Number

0

5

10

15

20

25

Aver

age tim

e(s/ite

ratio

n) Giraph: sync local comm. costGiraph: overall costPAGE: sync local comm. costPAGE: overall cost

(a) LDG partition scheme

2 4 8 16 32 64Partition Number

0

5

10

15

20

25

30

Aver

age tim

e(s/ite

ratio

n) Giraph: sync local comm. costGiraph: overall costPAGE: sync local comm. costPAGE: overall cost

(b) METIS partition scheme

2 4 8 16 32 64Partition Number

0

5

10

15

Aver

age tim

e(s/ite

ratio

n) Giraph: sync local comm. costGiraph: overall costPAGE: sync local comm. costPAGE: overall cost

(c) Random partition scheme

Fig. 10. Result of various partition numbers on social graph.

the two non-stationary ones are breadth first search(BFS) and single source shortest path (SSSP).

Figures 9(a) and 9(b) illustrate the performance ofstationary graph algorithms on Random and METISpartition schemes of uk-2007-05-u graph dataset. Wefind that for the stationary graph algorithms, whenthe quality of graph partition is improved, PAGEcan effectively use the benefit from a high qualitygraph partition and improve the overall performance.Compared with the Giraph and Giraph-GPSop, PAGEoutperforms them because PAGE concurrently pro-cesses incoming messages in a unified way.

Figures 9(c) and 9(d) present the performance ofnon-stationary graph algorithms. Similar to the sta-tionary graph algorithms, the performance of PAGEsurpasses those of Giraph and Giraph-GPSop. Thisimplies PAGE’s architecture can also facilitate the non-stationary graph algorithms. However, the perfor-mance is not improved when the quality of partitionscheme is increased. The reason has been discussed inSection 2. The non-stationary graph algorithms havea workload imbalance problem, which can be solvedby the dynamic partition strategy [17], [33].

6.3.3 Performance by varying numbers of partitions

Previous experiments are all conducted on a webgraph partitioned into fixed number of subgraphs,i.e., 60 partitions for uk-2007-05-u. In practice, thenumber of graph partitions can be changed, and dif-ferent numbers of partitions will result into differentpartition qualities. We run a series of experimentsto demonstrate that PAGE can also efficiently handlethis situation. Here we present the results of runningPageRank on a social graph, livejournal-u. Table 3 liststhe edge cut ratios of livejournal-u partitioned into 2,4, 8, 16, 32, 64 partitions by LDG, Random and METISrespectively.

First, Figures 10(a), 10(b) and 10(c) all show thatboth PAGE and Giraph perform better when the parti-tion number increases across three partition schemes,which is obvious as parallel graph computing sys-tems can benefit more from higher parallelism. Whenthe graph is partitioned into more subgraphs, each

Partition Scheme LDG(%) Random(%) METIS(%)2 Partitions 20.50 50.34 6.464 Partitions 34.24 75.40 15.658 Partitions 47.54 87.86 23.5416 Partitions 52.34 94.04 28.8332 Partitions 55.55 97.08 32.9364 Partitions 57.36 98.56 36.14

TABLE 3Partition quality of livejournal-u

subgraph has smaller size, and hence the overall per-formance will be improved with each worker havingless workload. On the other hand, the large numberof subgraphs brings heavy communication, so whenthe partition number reaches a certain threshold (e.g.,sixteen in the experiment), the improvement becomesless significant. This phenomenon reveals parallel pro-cessing large-scale graph is a good choice, and it willimprove the performance.

Second, the improvement between PAGE and Gi-raph decreases with the increasing number of parti-tions. As the number of partitions increases, the qual-ity of graph partitioning decreases which means thelocal message processing workload decreases. SinceGiraph performs well over the low quality graphpartitioning, the performance gap between PAGE andGiraph is small when the number of partitions is large.Besides, the point where PAGE and Giraph have closeperformance varies with different graph partitioningalgorithms. In METIS scheme, PAGE and Giraph havesimilar performance around 64 partitions, while inRandom scheme, it is about four partitions when theyare close. The reason is that different graph partitionalgorithms produce different partition quality, andthe bad algorithms will generate low quality graphpartition even the number of partitions is small.

Third, PAGE always performs better than Giraphacross three partition schemes for any fixed num-ber of partitions and the reason has been discussedin Section 6.3.1. But with the increasing number ofpartitions, the improvement of PAGE decreases. Thisis because the workload for each node becomes lowwhen the partition number is large. For a relatively

Page 13: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

13

small graph like livejournal, when the partition num-ber is around 64, each subgraph only contains tens ofthousand vertices. However, the results are sufficientto show the robustness of PAGE for various graphpartitions.

7 RELATED WORK

This work is related to several research areas. Not onlygraph computation systems are touched, but also thegraph partition techniques and effective integrationof them are essential to push forward current parallelgraph computation systems. Here we briefly discussthese related research directions as follows.

Graph computation systems. Parallel graph com-putation is a popular technique to process and analyzelarge scale graphs. Different from traditional big dataanalysis frameworks (e.g., MapReduce [11]), mostof graph computation systems store graph data inmemory and cooperate with other computing nodesvia message passing interface [13]. Besides, these sys-tems adopt the vertex-centric programming modeland release users from the tedious communicationprotocol. Such systems can also provide fault-tolerantand high scalability compared to the traditional graphprocessing libraries, such as Parallel BGL [12] andCGMgraph [9]. There exist several excellent systems,like Pregel, Giraph, GPS, Trinity.

Since messages are the key intermediate resultsin graph computation systems, all systems applyoptimization techniques for the message processing.Pregel and Giraph handle local message in computa-tion component but concurrently process remote mes-sages. This is only optimized for the simple randompartition, and cannot efficiently use the well parti-tioned graph. Based on Pregel and Giraph, GPS [27]applies several other optimizations for the perfor-mance improvement. One for message processingis that GPS uses a centralized message buffer in aworker to decrease the times of synchronization. Thisoptimization enables GPS to utilize high quality graphpartition. But it is still very preliminary and cannotextend to a variety of graph computation systems.Trinity [28] optimizes the global message distributionwith bipartite graph partition techniques to reduce thememory usage, but it does not discuss the messageprocessing of a single computing node. In this paper,PAGE focuses on the efficiency of a worker processingmessages.

At the aspect of message processing techniques, thereal-time stream processing systems are also related.Usually the stream processing systems are almostequal to message processing systems, since streams(or events) are delivered by message passing. N. Back-man et al. [4] introduced a system-wide mechanismto automatically determine the parallelism of a streamprocessing operator and the mechanism was builton simulation-based search heuristics. In this paper,

PAGE applies a node-level dynamic control model,but the basic idea is able to guide the design ofsystem-wide solution.

Graph partitioning algorithms. To evaluate theperformance of distributed graph algorithm, Maet al. introduced three measures, which are visittimes, makespan and data shipment. As efficiency(makespan) remains the dominant factor, they sug-gested to sacrifice visit times and data shipment formakespan, which advocates a well-balanced graphpartition strategy when designing distributed algo-rithms [22]. Actually, various graph partitioning al-gorithms focused on this object as well. METIS [16],[15] is an off-line graph partitioning package whichcan bring off high quality graph partitioning subjectto a variety of requirements. But it is expensive touse METIS partitioning large graphs. More recently,streaming graph partitioning models became appeal-ing [30], [31]. In the streaming model, a vertex arriveswith its neighborhood, and its partition id is decidedbased on the current partial graph information. Themodel is suitable for partitioning the large inputgraph in distributed loading context, especially forthe state-of-the-art parallel graph computation sys-tems. [30] described the difficulty of the problem andidentified ten heuristic rules, in which the LinearDeterministic Greedy (LDG) rule performs best. TheLDG assigns a vertex to a partition where the vertexhas the maximal neighbors. In addition, it applies alinear penalty function to balance the workload.

Besides, several studies [17], [33], [27] focus ondynamic graph repartition strategies to achieve thewell-balanced workload. The work of dynamic graphrepartition and PAGE are orthogonal. To balancethe workload, the proposed strategies repartition thegraph according to the online workload. Thus thequality of underlying graph partition changes alongwith repartitioning. The existing graph computationsystems may suffer from the high-quality graph par-tition at a certain point, but PAGE can mitigate thisdrawback and improve the performance by decreas-ing the cost of a worker further.

8 CONCLUSIONIn this paper, we have identified the partition un-aware problem in current graph computation systemsand its severe drawbacks for efficient parallel largescale graphs processing. To address this problem, weproposed a partition aware graph computation en-gine named PAGE that monitors three high-level keyrunning metrics and dynamically adjusts the systemconfigurations. In the adjusting model, we elaboratedtwo heuristic rules to effectively extract the systemcharacters and generate proper parameters. We havesuccessfully implemented a prototype system andconducted extensive experiments to prove that PAGEis an efficient and general parallel graph computationengine.

Page 14: PAGE: A Partition Aware Engine for Parallel Graph Computationmalin199/publications/2015.page... · 2020-03-02 · 1 PAGE: A Partition Aware Engine for Parallel Graph Computation Yingxia

14

ACKNOWLEDGMENTSThe research is supported by the National Natural Sci-ence Foundation of China under Grant No. 61272155and 973 program under No. 2014CB340405.

REFERENCES

[1] Giraph. https://github.com/apache/giraph.[2] Storm. http://storm.incubator.apache.org/.[3] A. Amr, S. Nino, N. Shravan, J. Vanja, and S. Alexander J.

Distributed largescale natural graph factorization. In WWW,pages 37–48, 2013.

[4] N. Backman, R. Fonseca, and U. Cetintemel. Managing paral-lelism for stream processing in the cloud. In HotCDP, pages1:1–1:5, 2012.

[5] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Groupformation in large social networks: membership, growth, andevolution. In SIGKDD, pages 44–54, 2006.

[6] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered labelpropagation: a multiresolution coordinate-free ordering forcompressing social networks. In WWW, pages 587–596, 2011.

[7] P. Boldi and S. Vigna. The webgraph framework i: compres-sion techniques. In WWW, pages 595–602, 2004.

[8] S. Brin and L. Page. The anatomy of a large-scale hypertextualweb search engine. In WWW, pages 107–117, 1998.

[9] A. Chan, F. Dehne, and R. Taylor. Cgmgraph/cgmlib: Imple-menting and testing cgm graph algorithms on pc clusters andshared memory machines. In Journal of HPCA, pages 81–97,2005.

[10] G. Cong, G. Almasi, and V. Saraswat. Fast pgas implementa-tion of distributed graph algorithms. In SC, pages 1–11, 2010.

[11] J. Dean and S. Ghemawat. Mapreduce: simplified data pro-cessing on large clusters. In OSDI, pages 107–113, 2004.

[12] D. Gregor and A. Lumsdaine. The parallel bgl: A genericlibrary for distributed graph computations. In POOSC, 2005.

[13] C. A. R. Hoare. Communicating sequential processes. InCACM, pages 666–677, 1978.

[14] U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations.In ICDM, pages 229–238, 2009.

[15] G. Karypis and V. Kumar. Parallel multilevel graph partition-ing. In IPPS, pages 314–319, 1996.

[16] G. Karypis and V. Kumar. Multilevel algorithms for multi-constraint graph partitioning. In CDROM, pages 1–13, 1998.

[17] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams,and P. Kalnis. Mizan: a system for dynamic load balancing inlarge-scale graph processing. In EuroSys, pages 169–182, 2013.

[18] J. M. Kleinberg. Authoritative sources in a hyperlinked envi-ronment. In Journal of the ACM, pages 604–632, 1999.

[19] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney.Statistical properties of community structure in large socialand information networks. In WWW, pages 695–704, 2008.

[20] J. Leskovec, K. J. Lang, and M. Mahoney. Empirical compari-son of algorithms for network community detection. In WWW,pages 631–640, 2010.

[21] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, andJ. M. Hellerstein. Distributed graphlab: a framework formachine learning and data mining in the cloud. In PVLDB,pages 716–727, 2012.

[22] S. Ma, Y. Cao, J. Huai, and T. Wo. Distributed graph patternmatching. In WWW, pages 949–958, 2012.

[23] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. Pregel: a system for large-scalegraph processing. In SIGMOD, pages 135–146, 2010.

[24] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Dis-tributed stream computing platform. In ICDMW, pages 170–177, 2010.

[25] J. Ousterhout. Why threads are a bad idea (for most purposes).In USENIX Winter Technical Conference, 1996.

[26] A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In SOSP,pages 472–488, 2013.

[27] S. Salihoglu and J. Widom. Gps: a graph processing system.In SSDBM, pages 22:1–22:12, 2013.

[28] B. Shao, H. Wang, and Y. Li. Trinity: A distributed graphengine on a memory cloud. In SIGMOD, pages 505–516, 2013.

[29] H. Simonis and T. Cornelissens. Modelling pro-ducer/consumer constraints. In CP, pages 449–462, 1995.

[30] I. Stanton and G. Kliot. Streaming graph partitioning for largedistributed graphs. In SIGKDD, pages 1222–1230, 2012.

[31] C. Tsourakakis, C. Gkantsidis, B. Radunovic, and M. Vojnovic.Fennel: Streaming graph partitioning for massive scale graphs.In WSDM, pages 333–342, 2014.

[32] S. Yingxia, Y. Junjie, C. Bin, and M. Lin. Page: A partitionaware graph computation engine. In CIKM, pages 823–828,2013.

[33] S. Zechao and Y. Jeffrey Xu. Catch the wind: Graph workloadbalancing on cloud. In ICDE, pages 553–564, 2013.

Yingxia Shao received his BSc in Informa-tion Security from Beijing University of Postsand Telecommunications in 2011. Now heis a third year PhD candidate in the Schoolof EECS, Peking University. His researchinterests include large-scale graph analysis,parallel computing framework and scalabledata processing.

Bin Cui is a Professor in the School ofEECS and Vice Director of Institute of Net-work Computing and Information Systems,at Peking University. His research interestsinclude database performance issues, queryand index techniques, multimedia databases,Web data management, data mining. He hasserved in the Technical Program Committeeof various international conferences includ-ing SIGMOD, VLDB and ICDE, and as VicePC Chair of ICDE 2011, Demo CO-Chair for

ICDE 2014, Area Chair of VLDB 2014. He is currently in the EditorialBoard of VLDB Journal, IEEE Transactions on Knowledge and DataEngineering (TKDE), Distributed and Parallel Databases Journal(DAPD), and Information Systems. Prof. Cui has published morethan 100 research papers, and is the winner of Microsoft YoungProfessorship award (MSRA 2008), and CCF Young Scientist award(2009).

Lin Ma is a junior undergraduate studentin the School of EECS, Peking University.He is taking internship in the group of Dataand Information Management. Topics he hasbeen working on include large graph pro-cessing, parallel graph computing and graphalgorithms.