using data accessibility for resource selection in …jon/papers/tpds_20_6.pdfusing data...

14
Using Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member, IEEE, Abhishek Chandra, Member, IEEE, and Jon B. Weissman, Senior Member, IEEE Abstract—Large-scale distributed systems provide an attrac- tive scalable infrastructure for network applications. However, the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. We introduce the notion of accessibility to capture both availability and per- formance. An increasing number of data-intensive applications require not only considerations of node computation power but also accessibility for adequate job allocations. For instance, selecting a node with intolerably slow connections can offset any benefit to running on a fast node. In this paper, we present accessibility-aware resource selection techniques by which it is possible to choose nodes that will have efficient data access to remote data sources. We show that the local data access observations collected from a node’s neighbors are sufficient to characterize accessibility for that node. By conducting trace- based, synthetic experiments on PlanetLab, we show that the resource selection heuristics guided by this principle significantly outperform conventional techniques such as latency-based or random allocations. The suggested techniques are also shown to be stable even under churn despite the loss of prior observations. Index Terms—Data Accessibility, Resource Selection, Passive Network Performance Estimation, Data-intensive Computing, Large-scale Distributed Systems. I. I NTRODUCTION Large-scale distributed systems provide an attractive scal- able infrastructure for network applications. This virtue has led to the deployment of several distributed systems in large- scale, loosely-coupled environments such as peer-to-peer com- puting [1], distributed storage systems [2]–[4], and Grids [5]– [7]. In particular, the ability of large-scale systems to harvest idle cycles of geographically distributed nodes has led to a growing interest in cycle-sharing systems [8] and @home projects [9], [10]. However, a major challenge in such systems is the network unpredictability and limited bandwidth available for data dissemination. For instance, the BOINC project [11] reports an average throughput of only about 289 Kbps, and a significant proportion of BOINC hosts shows an average throughput of less than 100 Kbps [1]. In such an environment, even a few MBs of data transfer between poorly connected nodes can have a large impact on the overall application performance. This has severely restricted the amount of data used in such computation platforms, with most computations taking place on small data objects. Emerging scientific applications, however, are data-intensive and require access to a significant amount of dispersed The authors are with the University of Minnesota, 4-192 EECS Bldg, 200 Union St. SE, Minneapolis, MN 55455, USA. E-mail: {jinohkim,chandra,jon}@cs.umn.edu. Manuscript received April 30, 2008; revised October 28, 2008; accepted January 6, 2009. This work was supported in part by National Science Foundation grant ITR-0325949 and CNS-0643505. data. Such data-intensive applications encompass a variety of domains such as high energy physics [12], climate pre- diction [13], astronomy [14] and bioinformatics [15]. For example, in high energy physics applications such as the Large Hadron Collider (LHC), thousands of physicists worldwide will require access to shared, immutable data at the scale of petabytes [16]. Similarly, in the area of bioinformatics, a set of gene sequences could be transferred from a remote database to enable comparison with input sequences [17]. In these examples, performance depends critically on efficient data delivery to the computational nodes. Moreover, the efficiency of data delivery for such applications would critically depend on the location of data and the point of access. Hence, in order to accommodate data-intensive applications in loosely- coupled distributed systems, it is essential to consider not only the computational capability, but also the data accessibility of computational nodes to the required data objects. The focus of this paper is on developing resource selection techniques suitable for such data-intensive applications in large-scale computational platforms. Data availability has been widely studied over the past few years as a key metric for storage systems [2]–[4]. However, availability is primarily used as a server-side metric that ignores client-side accessibility of data. While availability implies that at least one instance of the data is present in the system at any given time, it does not imply that the data is always accessible from any part of the system. For example, while a file may be available with 5 nines (i.e. 99.999% availability) in the system, real access from different parts of the system can fail due to reasons such as misconfiguration, intolerably slow connections, and other networking problems. Similarly, the availability metric is silent about the efficiency of access from different parts of the network. For example, even if a file is available to two different clients, one may have a much worse connection to the file server, resulting in much greater download time compared to the other. Therefore, in the context of data-intensive applications, it is important to consider the metric of data accessibility: how efficiently a node can access a given data object in the system. The challenge we address is the characterization of ac- cessibility from individual client nodes in large distributed systems. This is complicated by the dynamics of wide-area networks which rule out static a-priori measurement, and the cost of on-demand information gathering, which rules out active probing. Additionally, relying on global knowledge obstructs scalability, so any practical approach must rely on local information. To achieve accessibility-aware resource selection, we exploit local, historical data access observations. This has several benefits. First, it is fully scalable as it does

Upload: others

Post on 28-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

Using Data Accessibility for Resource Selection inLarge-scale Distributed Systems

Jinoh Kim, Member, IEEE, Abhishek Chandra, Member, IEEE, and Jon B. Weissman, Senior Member, IEEE

Abstract—Large-scale distributed systems provide an attrac-tive scalable infrastructure for network applications. However,the loosely-coupled nature of this environment can make dataaccess unpredictable, and in the limit, unavailable. We introducethe notion of accessibility to capture both availability and per-formance. An increasing number of data-intensive applicationsrequire not only considerations of node computation powerbut also accessibility for adequate job allocations. For instance,selecting a node with intolerably slow connections can offset anybenefit to running on a fast node. In this paper, we presentaccessibility-aware resource selection techniques by which it ispossible to choose nodes that will have efficient data accessto remote data sources. We show that the local data accessobservations collected from a node’s neighbors are sufficient tocharacterize accessibility for that node. By conducting trace-based, synthetic experiments on PlanetLab, we show that theresource selection heuristics guided by this principle significantlyoutperform conventional techniques such as latency-based orrandom allocations. The suggested techniques are also shown tobe stable even under churn despite the loss of prior observations.

Index Terms—Data Accessibility, Resource Selection, PassiveNetwork Performance Estimation, Data-intensive Computing,Large-scale Distributed Systems.

I. INTRODUCTION

Large-scale distributed systems provide an attractive scal-able infrastructure for network applications. This virtue hasled to the deployment of several distributed systems in large-scale, loosely-coupled environments such as peer-to-peer com-puting [1], distributed storage systems [2]–[4], and Grids [5]–[7]. In particular, the ability of large-scale systems to harvestidle cycles of geographically distributed nodes has led to agrowing interest in cycle-sharing systems [8] and @homeprojects [9], [10]. However, a major challenge in such systemsis the network unpredictability and limited bandwidth availablefor data dissemination. For instance, the BOINC project [11]reports an average throughput of only about 289 Kbps, anda significant proportion of BOINC hosts shows an averagethroughput of less than 100 Kbps [1]. In such an environment,even a few MBs of data transfer between poorly connectednodes can have a large impact on the overall applicationperformance. This has severely restricted the amount of dataused in such computation platforms, with most computationstaking place on small data objects.

Emerging scientific applications, however, are data-intensiveand require access to a significant amount of dispersed

The authors are with the University of Minnesota, 4-192 EECSBldg, 200 Union St. SE, Minneapolis, MN 55455, USA. E-mail:{jinohkim,chandra,jon}@cs.umn.edu.

Manuscript received April 30, 2008; revised October 28, 2008; acceptedJanuary 6, 2009.

This work was supported in part by National Science Foundation grantITR-0325949 and CNS-0643505.

data. Such data-intensive applications encompass a varietyof domains such as high energy physics [12], climate pre-diction [13], astronomy [14] and bioinformatics [15]. Forexample, in high energy physics applications such as the LargeHadron Collider (LHC), thousands of physicists worldwidewill require access to shared, immutable data at the scale ofpetabytes [16]. Similarly, in the area of bioinformatics, a set ofgene sequences could be transferred from a remote databaseto enable comparison with input sequences [17]. In theseexamples, performance depends critically on efficient datadelivery to the computational nodes. Moreover, the efficiencyof data delivery for such applications would critically dependon the location of data and the point of access. Hence, inorder to accommodate data-intensive applications in loosely-coupled distributed systems, it is essential to consider not onlythe computational capability, but also the data accessibility ofcomputational nodes to the required data objects. The focusof this paper is on developing resource selection techniquessuitable for such data-intensive applications in large-scalecomputational platforms.

Data availability has been widely studied over the past fewyears as a key metric for storage systems [2]–[4]. However,availability is primarily used as a server-side metric thatignores client-side accessibility of data. While availabilityimplies that at least one instance of the data is present inthe system at any given time, it does not imply that the data isalways accessible from any part of the system. For example,while a file may be available with 5 nines (i.e. 99.999%availability) in the system, real access from different parts ofthe system can fail due to reasons such as misconfiguration,intolerably slow connections, and other networking problems.Similarly, the availability metric is silent about the efficiencyof access from different parts of the network. For example,even if a file is available to two different clients, one mayhave a much worse connection to the file server, resulting inmuch greater download time compared to the other. Therefore,in the context of data-intensive applications, it is important toconsider the metric of data accessibility: how efficiently a nodecan access a given data object in the system.

The challenge we address is the characterization of ac-cessibility from individual client nodes in large distributedsystems. This is complicated by the dynamics of wide-areanetworks which rule out static a-priori measurement, and thecost of on-demand information gathering, which rules outactive probing. Additionally, relying on global knowledgeobstructs scalability, so any practical approach must relyon local information. To achieve accessibility-aware resourceselection, we exploit local, historical data access observations.This has several benefits. First, it is fully scalable as it does

Page 2: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

not require global knowledge of the system. Second, it is in-expensive as we employ observations of the node itself and itsdirectly connected neighbors (i.e. one-hop away). Third, pastobservations are helpful to characterize the access behavior ofthe node. For example, a node with a thin access link is likelyto show slow access most of the time. Last, by exploitingrelevant access information from the neighbors, it is possibleto obviate the need for explicit probing (e.g. to determinenetwork performance to the server), thus minimizing systemand network overhead.

Our key research contributions are as follows:• We present accessibility estimation heuristics which em-

ploy local data access observations, and demonstrate thatthe estimated data download times are fairly close to realmeasurements, with 90% of the estimates lying within0.5 relative error in live experimentation on PlanetLab.

• We infer the latency to the server based on the priorneighbor measurement without explicitly probing theserver. For this, we extend existing estimation heuris-tics [18]–[20] to more accurately work with a limitednumber of neighbors. Our enhancement gives accurateresults even with only a few neighbors.

• We present accessibility-aware resource selection tech-niques based on our estimation functions and compare tothe optimal and conventional techniques such as latency-based and random selection. Our results indicate thatour approach not only outperforms the conventionaltechniques, but does so over a wide range of operatingconditions.

• We investigate the impact of churn prevalent in loosely-coupled distributed systems. In fact, churn is critical toour resource selection techniques because we assume thatnodes lose all past observations when they join the systemagain. Despite this stringent memory-loss property, theresults show that our techniques perform well even undera certain degree of churn.

II. ACCESSIBILITY-BASED RESOURCE SELECTION

In this section, we present our system model followedby an overview of the accessibility-based resource selectionalgorithm that uses data accessibility to select appropriatecompute nodes in the system.

A. System Model

Our system model consists of a network of compute nodesthat provide computational resources for executing applicationjobs, and data nodes that store data objects required forcomputation. In our context, data objects can be files, databaserecords, or any other data representations. We assume bothcompute and data nodes are connected in an overlay structure.We do not assume any specific type of organization forthe overlay. It can be constructed by using typical overlaynetwork architectures such as unstructured [21], [22] andstructured [23]–[26], or any other techniques. However, weassume that the system provides built-in functions for objectstore and retrieval so that objects can be disseminated and

accessed by any node across the system. Each node in thenetwork can be a compute node, data node, or both.

Since scalability is one of our key requirements, we do notassume any centralized entities holding system-wide informa-tion. For this reason, any node in the system can submit ajob in our system model. A job is defined as a unit of workwhich performs computation on an object. To allocate a job,a submission node, called an initiator, selects a compute nodefrom a set of candidates. We assume the use of a resourcediscovery algorithm [7], [8], [27] to determine the set ofcandidate nodes, though it may not consider data localityin its choice. Once the initiator selects a node, the job istransferred to the selected node, called a worker. The workerthen downloads the data object required for the job fromthe network and performs the computation. When the jobexecution is finished, the worker returns the result to theinitiator.

Formally, job Ji is defined as a computation unit whichrequires object oi to complete the task. We assume that objects,e.g. oi, have already been staged in the network and perhapsreplicated to a set of nodes Ri = {r1

i , r2i , ...} based upon

projected demand. The job Ji is submitted by the initiator.From the given candidates C = {c1, c2, ...}, the initiatorselects one (i.e. worker ∈ C) to allocate the job.

B. Resource Selection

Figure 1 illustrates the resource selection process in oursystem model once the initiator has a set of candidate nodesto choose from. To select one of the given candidates, theinitiator first queries the candidates for relevant informationthat can be used for job allocation, since there is no entitywith global information (Figure 1(a)). The candidates offerthe relevant information (Figure 1(b)), based on which, theinitiator allocates the job to the selected worker (Figure 1(c)).To incorporate the impact of data access on the performanceof job execution, our goal is to select the best candidate nodein terms of accessibility to a data node (server) holding objectoi.

Due to the decentralized nature of our system, we wouldlike to make this selection without assuming any globalknowledge. To achieve this goal, we use an accessibility-based ranking function to order the different candidate nodes.Since our goal is to maximize the efficiency of data accessfrom the selected worker node, we use the expected datadownload time as the metric to quantify accessibility. Thus,given a set of candidates C for job Ji that requires accessto object oi, each candidate node cm returns its accessibilityaccessibilitycm(Ji) in terms of the estimated download timefor the object oi, and the initiator then selects the node with thesmallest accessibility value. Note that since we are assuminglack of any global knowledge, these estimates are based onthe local information available to the individual candidatenodes. Therefore, sometimes the candidate cannot provide anymeaningful estimate of its accessibility to the required dataobject. In this case, the candidate simply returns a value of∞,indicating the lack of any information. The initiator wouldfilter out such a candidate. If all candidates return ∞, one of

Page 3: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

Fig. 1. Accessibility-based resource selection

the candidates is randomly selected. Formally, the selectionheuristic Hs is defined as follows:

Hs : C → cm such that

accessibilitycm(Ji) = min

n=1,..,|C|(accessibilitycn

(Ji)).

Having described the accessibility-based resource selectionprocess, the question is how the candidate nodes can estimatetheir accessibility using local information (e.g. their ownobservations to the object if known or their neighbors), andwhat factors they can use for this estimation. We explore thisquestion in the next section.

III. ACCESSIBILITY ESTIMATION

A. Accessibility Parameters

To answer the above question, we first investigate what pa-rameters would impact accessibility in terms of data downloadtime. Intuitively, a node’s accessibility to a data object willdepend on two main factors: the location of the data objectwith respect to the node, and the node’s network characteris-tics, such as its connectivity, bandwidth, and other networkingcapabilities. We have explored a variety of parameters tocharacterize these factors, and found some interesting corre-lations. For this characterization, we conducted experimentson PlanetLab [28] with 133 hosts over three weeks. In theseexperiments, 18 2MB data objects were randomly distributedover the nodes, and over 14,000 download operations werecarried out to form a detailed trace of data download times. Tomeasure inter-node latencies, an ICMP ping test was repeatednine times over the 3-week period, and the minimal latencywas selected to represent the latency for each pair. We nextgive a brief description of the main results of this study.

The first result is the correlation of latency and downloadspeed (defined as the ratio of downloaded data size anddownload time) between node pairs. Figure 2(a) plots therelationship between RTT and download speed. We find anegative correlation (r = −0.56) between them, indicatingthat smaller latency between client and server would lead tobetter performance in downloading. Thus, latency can be auseful factor when estimating accessibility between node pairs.

In addition, we discovered a correlation between the down-load speed of a node for a given object and the past aver-age download speed of the node, as shown in Figure 2(b)(r = 0.6). The intuition behind this correlation is that past

download behavior may be helpful to characterize the nodein terms of its network characteristics such as its connectivityand bandwidth. For example, if a node is connected to thenetwork with a bad access link, it is almost certain that thenode will yield low performance in data access to any datasource. This result suggests that past download behavior of anode can be a useful component for accessibility estimation.

Based on the statistical correlations we discovered, wenext present estimation techniques to predict data accesscapabilities of a node for a data object. Note that we do notassume global knowledge of these parameters (e.g. pairwiselatencies between different nodes), but use hints based on localinformation at candidate nodes to get accessibility estimates.It is worth mentioning that it is not necessary to estimatethe exact download time; rather our intention is to ranknodes based on accessibility so that we can choose a goodnode for job allocation. Nonetheless, if the estimation haslittle relevance to the real performance, then the ranking maydeviate far from the desired choices. Hence we require that theestimation techniques demonstrate sufficiently accurate resultswhich can be bounded within a tolerable error range.

B. Self-Estimation

As described in III-A, latency to the server1 and downloadspeed of a node are useful to assess its accessibility to adata object. We first provide an estimation technique that useshistorical observations made by a node during its previousdownloads to estimate these parameters. Note that these pastdownloads can be to any objects and need not be for the objectin question. We refer to this technique as self-estimation.

To employ past observations in the estimation process,we assume that the node records access information it hasobserved. Suppose Hi

h is the i-th download entry at host h.This entry includes following information: object name, objectsize, download elapsed time, server, distance to server, andtimestamp. As a convention, we use dot(.) notation to referto an item of the entry, for example, Hi

h.size represents theobject size in i-th observation at host h, and |Hh| denotes thenumber of observations at host h.

We first estimate a distance factor between the node and theserver, based on their inter-node latency. For this, we consider

1For ease of exposition here, we assume each data object is located on asingle server. However, we relax this assumption and consider data replicationin our experiments in Section IV-I.

Page 4: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

0 100 200 300 400 5000

200

400

600

800

1000

1200

RTT (msec)

Dow

nloa

d S

peed

(K

B/s

)

Correlation between RTT and Download Speed

cor = −0.56

(a) Correlation of RTT and download speed

0 50 100 150 200 250 300 350 400 4500

200

400

600

800

1000

1200

Past Download Speed (KB/s)

Dow

nloa

d S

peed

(K

B/s

)

Correlation between Past Download Speed and Download Speed

cor = 0.60

(b) Correlation of download speed and past average downloadspeed

Fig. 2. Correlations of access parameters

several related latency models for the distance metric: RTT andsquare-root of RTT. These are often used in TCP studies tocope with congestion efficiently to improve system throughput.Studies of window-based [32] and rate-based [33] congestioncontrol revealed that RTT and square-root of RTT are inverselyproportional to system throughput, respectively. We considerboth latency models for the distance metric and compare themto see which is preferable later in this section. The meandistance of node h to the servers is then computed by::

Distanceh =1

|Hh| ·|Hh|∑

i=1

Hih.distance

We then characterize the network characteristics of thenode by estimating its mean download speed based on priorobservations. The mean download speed of node h is definedas:

DownSpeedh =1

|Hh| ·|Hh|∑

i=1

Hih.size

Hih.elapse

Using the above factors, we estimate the expected downloadtime for a host h to download object o as:

SelfEstimh(o) = δ · size(o)DownSpeedh

(1)

whereδ =

distanceh(server(o))Distanceh

Here, size(o) means the size of object o, server(o) meansthe server for object o, and distancea(b) means the distancebetween nodes a and b.

Intuitively, The parameter δ gives a ratio of the distance tothe server for object o to the mean distance it has observed.Smaller δ means that the distance to the server is closer thanthe average distance, and hence its estimated download time islikely to be smaller than previous downloads. The other partof Equation 1 uses the mean download speed to derive theestimated download time as being proportional to the objectsize.

To see how well self-estimation performs, we conducted asimulation with the data set mentioned earlier in this section.

0 0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio = (Estimated/Measured)

Cum

ulat

ive

Fra

ctio

n

CDF Comparison of Self Estimation Results

Distance=RTT+1Distance=sqrt(RTT+1)

Fig. 3. Self estimation result

To assess the accuracy, we compute the ratio of the estimatedresult to the measured one. Thus, an estimated-to-measuredratio of 1 means that the estimation is perfect. If the ratio is0.5, it means an underestimation by a factor of two, whereasa ratio=2 means an overestimation by a factor of two. In thesimulation, the node attempts estimation using Equation 1 withthe observations it measured in the data set. The estimationwas performed against all the actual measurements.

Figure 3 presents the estimation results of self-estimation ina cumulative distribution graph with the ratio of the estimatedto the measured. As can be seen in the figure,

√RTT shows

better accuracy than the native RTT. Using√

RTT , nearly90% of the total estimations occur within a factor of two(i.e. within 0.5 and 2 in the x-axis). In contrast, the nativeRTT yields 79% of the total estimations within the same errormargin. Based on this result, we make use of the square-root ofRTT as the distance metric2. With this distance metric, we cansee that a significant portion of the estimations occur aroundthe ratio 1, indicating that the estimation function is fairlyaccurate. We will see in Section IV that this level of accuracyis sufficient for use as a ranking function to rank different

2We set distance =√

RTT + 1, where RTT is in millisecond and 1 isadded to avoid division by zero.

Page 5: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

candidate nodes for resource selection.We then investigated the impact of the number of observa-

tions in estimation. For this we traced how many estimatesreside within a factor of two against the measured ones, andobserved that self estimation produces fairly accurate resultseven with a limited number of observations. Initially, thefraction was quite small (below 0.7), but it sharply increasedas more observations were made. With 10 observations, forexample, the fraction goes beyond 0.8, and approaches 0.9with 20 observations. This result allows us to maintain a finite,small number of observations (by applying a simple aging-out technique, for example) to achieve a certain degree ofaccuracy.

Since self-estimation is not required to have prior observa-tions for the object in question, it must first search for theserver and then determine the network distance to it. Searchis often done by flooding in unstructured overlays [34], or byrouting messages in structured overlays [23]–[26], which mayintroduce extra traffic. Distance determination would requireprobing which adds additional overhead.

C. Neighbor Estimation

While self-estimation uses a node’s prior observations toestimate the accessibility to a data object, it is possible thatthe node may have only a few prior download observations(e.g. if it has recently joined the network), which couldadversely impact the accuracy of its estimation. Further, asmentioned above, self-estimation also needs to locate theobject’s server and determine its latency to the server to get amore accurate estimation. This server location and probingcould add additional overhead and latency to the resourceselection.

To avoid these problems, we now present an estimationapproach that utilizes the prior download observations froma node’s neighbors in the network overlay for its estimation.We call this approach neighbor estimation. The goal of thisapproach is to avoid any active server location or probing.Moreover, by utilizing the neighbors’ information, it is likelyto obtain a richer set of observations to be used for estimation.However, the primary challenge with using neighbor informa-tion is to correlate a neighbor’s download experience to thenode’s experience given that the neighbor may be at a differentlocation and may have different network characteristics fromthe node. Hence this work is different from previous passiveestimation work [35], [36] which exploited topological orgeographical similarity (e.g. a same local network or a sameIP prefix). Instead we characterize the node with respect todata access, and then make an estimation by correlating thecharacterized values to ones from the neighbor, thus enablingthe sharing observations without any topological constraintsbetween neighbors.

To assess the downloading similarity between a candidatenode and a neighbor, we first define the notion of downloadpower (DP) to quantify the data access capability of a node.The idea is that a node with higher DP is considered to besuperior in downloading capability to a node with lower DP.

0 20 40 60 80 100500

1000

1500

2000

2500

3000

Time (number of computations of DP)

Com

pute

d D

P

DP changes over time

Fig. 4. Snapshot of DP changes

We formulate DP for a host h as follows:

DPh =1

|Hh||Hh|∑

i=1

( Hih.size

Hih.elapse

×Hih.distance

)(2)

Intuitively, this metric combines the metrics of downloadspeed and distance defined in the previous subsection. As seenfrom Equation 2, DP ∝ download speed, which is intuitive,as it captures how fast a node can download data in general.Further, we also have DP ∝ distance to the server, whichimplies that for the same download speed to a server, thedownload power of a node is considered higher if it is moredistant from the server. Consider an example to understand thisrelation between download power and distance. Suppose thattwo client nodes, one in the US and one in Asia, access datafrom servers located in the US. Then, if the two clients showthe same download time for the same object, the one in Asiamight be considered to have better downloading capability formore distant servers, as the US client’s download speed couldbe attributed to its locality. Hence, access over greater distanceis given greater weight in this metric. To minimize the effectof download anomalies and inconsistencies, we compute DPas the average across its history of downloads from all servers.Figure 4 shows a snapshot of DP value changes for 10 samplednodes. We can see that DP values become stable with manymore observations over time. According to our observations,node DP changes of greater than ±10% were less than 1% ofthe whole.

Now, we define a function for neighbor estimation at hosth by using information from neighbor n for object o:

NeighborEstimh(n, o) = α · β · elapsen(o) (3)

where

α =DPn

DPh, β =

distanceh(server(o))distancen(server(o))

,

and elapsen(o) is the download time observed by the neighborfor the object. It is possible that the neighbor has multipleobservations for the same object, in which case we pick thesmallest download time as the representative.

Intuitively, to estimate the download time for object obased on the information from neighbor n, this function usesthe relevant download time of the neighbor. As a rule, the

Page 6: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

0 0.5 1 1.5 2 2.5 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ratio = (Estimated/Measured)

Cum

ulat

ive

Fra

ctio

n

CDF of Neighbor Estimation Results

Fig. 5. Neighbor estimation result

estimation result is the same if all conditions are equivalentto the neighbor. To account for differences, we employ twoparameters α and β. The parameter α compares the downloadpowers of the node and the neighbor for similarity. If the DP ofthe node is higher than the neighbor, the function gives smallerestimation time because the node is considered superior to theneighbor in terms of accessibility. The parameter β comparesthe distances to the server, so that if the distance to theserver is closer for the node than the neighbor’s, the resultingestimation will be smaller3. These correlations enable us toshare observations between neighbors without any topologicalrestrictions.

Figure 5 illustrates the cumulative distribution of neighborestimation results. Like the simulation in self estimation, theestimation was made against all the actual measurementswith a relevant observation measured in any other node.In other words, for the measured one, an estimation wasattempted for every single observation that any other nodemeasured for that object. As seen from the figure, a substantialportion of the estimated values are located near the ratio 1.Nearly 84% of estimations reside within a factor of two ofthe corresponding measurements. This suggests that neighborestimation produces useful information to rank nodes withrespect to accessibility.

While neighbor estimation is useful for assessment of acces-sibility, multiple neighbors can provide different informationfor the same object. For example, if three neighbors offer theirobservations to a node, there can be three estimates which mayhave different values. Thus we need to combine those differentestimates to obtain more accurate results. We examined severalcombination functions, such as median, truncated mean, andweighted mean, and observed that taking the median valueworks well even with a small number of neighbors. Given thatthe number of neighbors providing relevant observations maybe limited in many cases, we believe that taking the medianshould be a good choice. Hence, combining multiple estimatesis done by:

NeighborEstimh(o) = median(NeighborEstimh(ni, o)),

3We discuss how the server distance can be estimated without active probingin III-D.

for all ni ∈ N ′, where N ′ is a subset of the neighbor setN (N ′ ⊆ N ) which only includes neighbor nodes offeringNeighborEstimh(ni, o).

We observed that combining multiple estimates with themedian function significantly improves the accuracy. Althoughomitted due to space reasons, estimation with 4 neighborobservations yielded nearly 90% of estimates within a factorof two, while it was 84% with a single neighbor observation.It becomes over 92% with 8 neighbor observations.

To realize neighbor estimation, it is necessary to gatherinformation from the neighbor nodes. This can be done by on-demand requests, background communications, or any hybridform of them. Piggybacking on periodic heartbeats in theoverlay network can be a practical option to save overhead.

D. Inferring Server Latency without Active Probing

While neighbor estimation requires latency to the server asa parameter (Equation 3), we can avoid the need for activeprobing by exploiting the server latency estimates obtainedfrom the neighbors themselves. If a neighbor has contactedthe server, it could obtain the latency at that time by usinga simple latency computation technique, e.g. the time differ-ence between TCP SYN and SYNACK when performing thedownload, and this latency information can be offered to theneighbor nodes. By utilizing the latency information observedin the neighbor nodes, it is possible to minimize additionaloverhead in estimation in terms of server location and pinging.

According to the study in [39], a significant portion of totalpaths (> 90%) satisfied the property of triangle inequality.We also observed that 95% of total paths in our data satisfiedthis property. The triangulated heuristic estimates the networkdistance based on this property. It infers latency betweenpeers with a set of landmarks which hold precalculated la-tency information between the peers and themselves [20].The basic idea is that the latency of node a and c may liebetween |latency(a, b) − latency(b, c)| and latency(a, b) +latency(b, c), where b is one of the landmarks (b ∈ B). Witha set of landmarks, it is possible to obtain a set of lower bounds(LA) and upper bounds (UA). If we define L = max(LA) andU = min(UA), then the range [L,U ] should be the tighteststretch with which all inferred results may agree. For theinferred value, Hotz [18] suggested L because it is admissibleto use A* search heuristic, while H and all linear combinationsof L are not admissible. Guyton and Schwartz [19] employed(L + U)/2, and most recently Eugene and Zhang reported Uperforms better than the others [20].

In our system model, we can use neighbors as the landmarksbecause they hold latency information both to the candidateand to the object server. By applying the triangulated heuristic,therefore, we can infer the latency between the candidate andthe server without probing. However we found that the existingheuristics are inaccurate with a small number of neighborswhich may be common in our system model. Hence weenhance the triangulated heuristic to account for a limitednumber of neighbors.

Our approach works by handling several situations thatcontribute to inaccuracy. For example, it is possible to have

Page 7: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

0 5 10 15 20 25 30 350

20

40

60

80

100

120

140

Number of Landmarks (Neighbors)

Abs

olut

e E

rror

(m

sec)

RTT Inference Absolute Error

LU(L+U)/2Enhanced

(a) Absolute Error

0 5 10 15 20 25 30 350

1

2

3

4

5

6

Number of Landmarks (Neighbors)

Rel

ativ

e E

rror

RTT Inference Relative Error

LU(L+U)/2Enhanced

(b) Relative Error

Fig. 6. Latency inference results

L > U due to some outliers, for which the triangle inequalitydoes not hold. Consider the following situation: all but onelandmark give reasonable latencies, but if that one gives fairlylarge low and high bounds, the expected convergence wouldnot occur, thus leading to an inaccurate answer. To overcomethis problem, we remove all Li ∈ LA which are greater thanU , so we can make a new range that satisfies L < U . Afterdoing so, we observed that taking simple mean produces muchbetter results than the existing approaches.

We also observed a problematic situation where a significantportion of the inferred low bounds suggest similar values, buthigh bounds have a certain degree of variance. This happenswhere node c is close to a but the landmarks are all apartfrom node a. For this, we consider a weighted mean basedon standard deviations (σ). The intuition behind this is that ifmultiple inferred bounds suggest similar values for either lowor high bounds, it is likely that the real latency is around thatpoint. We take the weighted mean when it fails to convergedue to the range being too wide, where picking any one ofL, U , and (L + U)/2 is likely to be highly inaccurate. Theweighted mean is defined as follows:

L ·(1− σLA

(σLA+ σUA

)

)+ U ·

(1− σUA

(σLA+ σUA

)

)

We report the evaluation results with the absolute erroras well as the relative error for clarity. For example, if wethink of two measured latencies 1 ms and 100 ms and thecorresponding estimations 2 ms and 200 ms, then those twoestimations give the same picture with respect to the relativeerror (i.e. relative error = 1, in this example). In contrast, theyconvey different information with respect to absolute error. Infact, 1 ms difference is usually acceptable, but 100 ms erroris not for latency inference.

Figure 6 demonstrates the inference results. As reportedin [20], the heuristic employing U is overall better thanthe other two existing heuristics. However, we can see thatour enhanced heuristic substantially outperforms the existingheuristics with respect to both relative and absolute errormetrics. In particular, the enhanced heuristic works well evenwhen the number of landmarks is small. Since the numberof neighbors which can offer the relevant latency information

TABLE IDOWNLOAD TRACES

Trace # nodes # objects # downloads1M 153 72 22,5092M 231 83 25,9344M 167 107 28,4398M 158 85 26,105

may be limited, the enhanced heuristic is desirable in ourdesign. In other words, it is possible to infer the latency tothe server with fairly high accuracy even in the case whereonly a few neighbor nodes can provide relevant information.

IV. EVALUATION

A. Experimental Setup

We conducted over 100K actual downloading experimentsfor a span of 5 months with 241 PlanetLab nodes geograph-ically distributed across the globe. For this, we deployed aPastry [25], [40] network, a structured overlay based on aDHT ring. We distributed data objects of four sizes: 1M,2M, 4M, and 8M bytes, over the network, each object witha unique key. We then generated a series of random queriesso that the selected nodes perform downloading the relevantobjects. Table I provides the details of the download traces.In the simulations, we use a mixture of all traces rather thanindividual traces, unless otherwise mentioned.

To evaluate resource selection techniques, we design andimplement a simulator which inputs the ping maps and thecollective downloading traces and outputs performance resultsaccording to the selection algorithms. Initially, the simulatorconstructs a network in which nodes are randomly connectedto each other with a predefined neighbor size without anylocality or topological considerations. To minimize error due tothe construction, we repeated simulations 50 times and reportthe results with 95% confidence intervals as needed. Afterconstructing the network, the simulator runs each resourceselection algorithm. Initially, it constructs a virtual trace inwhich the list of candidates and the download time from eachcandidate are recorded. The candidate nodes are randomlychosen for each allocation. As the candidate may have more

Page 8: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

0 1 2 3 4 5 6 7 8 9 10

x 104

1

2

3

4

5

6

7

8

9

10

Time

Rat

io to

Opt

imal

Comparison over time (Mix, C=8, N=8, Allocation=100k)

RANDOMPROXIMSELFNEIGHBOR

Fig. 7. Performance over the time

than one actual download record for a server, the downloadtime is also randomly selected from them. The simulator thenselects a worker based on each selection algorithm. Based onthe selected worker, the download time is returned from thevirtual trace.

For our evaluation, we compare the resource selectiontechniques based on our estimation techniques with two con-ventional techniques: random and latency-based selections.The following describes the resource selection techniques:• OMNI: Optimal selection• RANDOM: Random selection• PROXIM: Latency-based selection• SELF: SELF basically performs the selection by self

estimation. One exception is that it allows the nodeto make an estimation by partial observations if it hasany observations to the object server4. This can improveaccuracy. If no estimate is available, it performs randomselection.

• NEIGHBOR: NEIGHBOR performs the selection basedon neighbor estimation. If no estimate is available, itperforms random selection.

B. Performance Comparison over Time

We begin by presenting the performance comparison overtime. Figure 7 compares the performance over the 100Kconsecutive job allocations. As the default, we set both thecandidate size and the neighbor size to 8 (and it is applied toall the following experiments, unless otherwise mentioned).Overall the proposed techniques yield good results: SELFis the best across time and NEIGHBOR works better thanPROXIM most of the time. RANDOM yields poor performancewith a significant degree of variation, as expected. PROXIMis about 3 times of optimal with a relatively high degree ofvariation compared to the suggested techniques. SELF worksbest approaching about 1.4 of optimal at the end of thesimulation. This shows that simple consideration of past accessbehavior in addition to latency greatly benefits to choose agood candidate.

4This is done by a simple statistical estimator: size(o)

DownSpeedh(s), where

DownSpeedh(s) stands for the mean download speed from the node to theserver.

2 4 8 16 321

2

3

4

5

6

7

Neighbor Size

Rat

io to

Opt

imal

Impact of Neighbor Size (Mix, C=8, Allocation=50k)

RANDOMPROXIMSELFNEIGHBOR

Fig. 9. Impact of neighbor size

NEIGHBOR is poor at first, but outperforms PROXIM afterabout 6K simulation time steps. This is because there maybe many more chances of random selection at first stage;after warming up, however, it exploits neighbor observations,leading to better performance. Nonetheless, NEIGHBOR showsa noticeable gap to SELF. This can be explained mainly by thehit rate on the number of relevant observations from the neigh-bors; we observed that the average number of observations wasaround 2 even at the end of the simulation, while neighborestimation yields stable results with more than 4 observationsas discussed in III-C. Thus, NEIGHBOR could perform betterwith a higher hit rate; improving hit rate is part of our futurework.

C. Impact of Candidate Size

In our system model, a set of candidate nodes are evaluatedfor their accessibility before allocating a job. We now investi-gate the impact of candidate size (|C|). Figure 8 demonstratesthe performance changes with respect to candidate size. InFigure 8(a), mean ratio to optimal increases along the candi-date size. This is because OMNI has many more chances tosee better candidates to choose from, resulting in the largerperformance gaps. Nonetheless, we can see that the suggestedtechniques work better with many more candidates, making theslopes gentle compared to the conventional ones. Figure 8(b)compares mean download time for the selection techniques.As seen in the figure, SELF continues to produce diminishedelapsed times as the candidate size increases, yielding the bestresults among selection techniques. NEIGHBOR follows SELFwith considerable gaps against the conventional techniques.Interestingly, PROXIM shows unstable results with greaterfluctuation than RANDOM over the candidate sizes. This resultindicates that the proposed techniques not only work betterthan conventional ones across candidate sizes, but also furtherimprove as the candidate size increases.

D. Impact of Neighbor Size

We next investigate the impact of neighbor size on NEIGH-BOR (the other heuristics are not affected by this parameter).Figure 9 shows how the selection techniques respond acrossthe number of neighbors (|N |). As can be seen in the figures,

Page 9: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

2 4 8 16 321

2

3

4

5

6

7

8

9

Candidate Size

Rat

io to

Opt

imal

Impact of Candidate Size (Mix, N=8, Allocation=50k)

RANDOMPROXIMSELFNEIGHBOR

(a) Mean ratio to optimal

2 4 8 16 320

10

20

30

40

50

60

70

80

90

Candidate Size

Mea

n E

laps

e T

ime

(sec

)

Impact of Candidate Size (Mix, N=8, Allocation=50k)

OMNIRANDOMPROXIMSELFNEIGHBOR

(b) Mean download elapsed time

Fig. 8. Impact of candidate size

increasing the neighbor size dramatically improves the perfor-mance, while the others make no changes as expected. Forexample, the average download time in |N | = 16 is droppedto about 70% of the time for |N | = 2. The ratio to optimalis also dropped from 4.0 at |N | = 2 to 2.6 at |N | = 16. Thisis because it has more chances to obtain relevant observationswith many more neighbors, thus decreasing the possibility ofrandom selection. This result suggests that NEIGHBOR willwork better in environments where the node has connectivitywith a greater number of neighbors.

E. Impact of Data Size

We continue to investigate how the selection techniqueswork over different data sizes. Since the size of accessedobjects can vary depending on applications in reality, selectiontechniques should work consistently across a range of datasizes. In this experiment, we run the simulation with individualtraces rather than the mixture of the traces. In Figure 10,we can see linear relationship between data size and meandownload time. However, each technique shows a differentdegree of slope: SELF and NEIGHBOR increase more gentlythan the conventional heuristics. With simple calculation, theslopes (i.e. ∆y/∆x) of the techniques are RANDOM=10.9,PROXIM=8.1, SELF=3.8, and NEIGHBOR=5.1. This result im-plies that the proposed techniques not only work consistentlyacross different data sizes, but they are also much more usefulfor data-intensive applications.

F. Timeliness

While it is crucial to choose good nodes for job allocation, itis also important to avoid bad nodes when making a decision.For instance, selecting intolerably slow connections may leadto job incompletion due to excessive downloading cost or time-outs. However, it is almost impossible to pick good nodesevery time because there are many contributing factors.

We observed how many times the techniques choose slowconnections. Figure 11 shows the cumulative distributions ofthe speed of connections with log-log scales, and we can seethat the proposed techniques more often avoid slow connec-tions. SELF most successfully excludes low speed connections,

1MB 2MB 4MB 8MB0

10

20

30

40

50

60

70

80

90

100

Data Size

Mea

n E

laps

e T

ime

(sec

)

Impact of Data Size (C=8, N=8, Allocation=50k)

OMNIRANDOMPROXIMSELFNEIGHBOR

Fig. 10. Impact of data size

and NEIGHBOR also performs better than the conventionaltechniques. When we count the number of poor connectionsselected, SELF chose connections under 5 KB/s less than30 times, while PROXIM made over 290 selections which isalmost an order of magnitude larger than SELF. One inter-esting result is that PROXIM selects poor connections morefrequently than RANDOM (293 and 194 times respectively).This implies that relying only on latency information alonegreatly increases the chance of very poor connections, thusleading to unpredictable response time. Compared to this, ourproposed techniques successfully reduce chances to chooselow speed connections by taking accessibility into account.

G. Multi-object Access

Many distributed applications request multiple objects [41],which means that a job of such applications accesses morethan one object to complete the task. For example, bioinfor-matics applications such as BLAST [15] repeatedly accessremote gene databases to compare patterns. We conductedexperiments to see the impact of multi-object access. Figure 12shows the results where jobs require to access multiple objects.As can be seen in the figure, the ratio to optimal graduallydecreases with increasing number of objects for all selectiontechniques. This is because even optimally selected nodes

Page 10: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

10−1

100

101

102

103

104

10−5

10−4

10−3

10−2

10−1

100

Download Speed (KB/s)

Cum

ulat

ive

Fra

ctio

n

CDFs of Download Speed (Mix, C=8, N=8, Allocation=100k)

SELF

NEIGHBOR

RANDOM

PROXIM

Fig. 11. Cumulative distribution of download speed

1 2 4 81

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Number of objects to access

Rat

io to

Opt

imal

Multi−object access (Mix, C=8, N=8, Allocation=50k)

RANDOMPROXIMSELFNEIGHBOR

Fig. 12. Multi-object access

may not have good performance to some objects, resultingin greatly increased download times. SELF and NEIGHBORnot only consistently outperform the conventional techniquesover the number of objects, but they also approach optimal(ratio = 1.24 and 1.55 when the number of objects is 8). Tosum it up, the suggested techniques also work better than theconventional techniques for multi-object access.

H. Impact of Churn

Churn is prevalent in loosely-coupled distributed systems.To see the impact of churn, we assume that mean sessionlengths of nodes are exponentially distributed. In this context,the session length is equivalent to the simulation time. Forexample, if the session length of a node is 100, the nodechanges its status to inactive after 100 simulation time steps.The node then joins again after another 100 time steps. Weassume that nodes lose all past observations when they changestatus. Therefore, churn will have a greater impact on ourselection techniques because we rely on historic observations.In contrast, the conventional techniques suffer little from churnsince they do not have any dependence on past observations.The virtual trace excludes objects for which the relevantservers are inactive. We tested three mean session lengths:s = 100, s = 1000, and s = 10000, corresponding to extreme,severe, and light churn rates respectively.

Figure 13 illustrates the impact of churn. As mentioned,

there is little impact on conventional techniques, while ourtechniques are degraded in performance due to loss of obser-vations. In Figure 13(a), SELF is comparable to PROXIM evenunder extreme churn. NEIGHBOR degrades and becomes worsethan PROXIM under severe churn (s = 1000). This is becauseNEIGHBOR is likely to fail to collect the relevant observations,thus relying more on random selection, while SELF canperform reasonably accurate estimation with only a dozen ofobservations. Nonetheless, NEIGHBOR still works better thanPROXIM in light churn (s = 10000) with lower overhead.Figure 13(b) explains why NEIGHBOR suffers under severeand extreme churn. In the figure, the neighbor estimationrate means the fraction that NEIGHBOR successfully estimatesbased on neighbor estimation rather than random selection.Under light churn, the neighbor estimation rate is still over90%, but it drops to 60–70% in severe churn, implying that30–40% of the decisions have been made by random selection.Under extreme churn, the neighbor estimation rate drops below10%, so it essentially reduces to RANDOM.

To summarize, the proposed techniques are fairly stableunder churn in which nodes suffer from loss of observations.The result shows that SELF is comparable to PROXIM evenunder extreme churn, while NEIGHBOR is comparable toPROXIM when churn is light.

I. Impact of Replication

In loosely-coupled distributed systems, replication is oftenused to disseminate objects to provide locality in data accessas well as high availability. We investigate the impact ofreplication to see if the proposed techniques consistently workin replicated environments.

For this, we construct replicated environments in whichsame-sized objects in the traces are grouped according to thereplication factor and the object in the group is considered asa replica. The virtual trace is then constructed based on thegroup of the objects. In detail, for all objects in the group,a randomly selected download time from each candidate isrecorded in the virtual trace. The simulator then returns thedownload time according to the selected candidate and thereplica server.

RANDOM will work the same as in no replication envi-ronment with a random function to choose both a candidateand a replica server. PROXIM measures latencies from everycandidate to every server, and then the pair with the smallestlatency will be selected. SELF is similar to PROXIM: eachcandidate calculates the accessibility for each server, andreports the best one. In the case of NEIGHBOR, the candidategathers all the relevant information from the neighbors. Ifit finds more than one server, NeighborEstim(o) functionis performed against each server, and then the best one isreported to the initiator. For both SELF and NEIGHBOR, theinitiator finally selects the candidate with the best accessibility.

Figure 14 shows performance changes across replicationfactors (Rf ). It is likely that the performance of all selectiontechniques improve as the replication factor increases becauseof data locality, and the result agrees with this expectation, asshown in Figure 14(b). PROXIM has significantly diminished

Page 11: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

No churn 10000 1000 1000

10

20

30

40

50

60

70

Mean Session Length

Mea

n E

laps

e T

ime

(sec

)

Impact of Churn (Mix, C=8, N=8, Allocation=50k)

OMNIRANDOMPROXIMSELFNEIGHBOR

(a) Mean download elapsed time

0 2 4 6 8 10

x 104

0

10

20

30

40

50

60

70

80

90

100

Time

Nei

ghbo

r E

stim

atio

n R

ate

(%)

Neighbor Estimation Rate (Mix, C=8, N=8, Allocation=100k)

No churnSessionLength=100SessionLength=1000SessionLength=10000

(b) Neighbor estimation rate

Fig. 13. Impact of churn

mean download time (nearly half) under replication, but it isstill worse than the proposed techniques. SELF and NEIGHBORoutperform the conventional techniques over all the replicationfactors. In Figure 14(a), we can see that SELF further reducesratio to optimal as replication factor increases, while the othersincrease. NEIGHBOR widens the gap against conventionaltechniques with increasing replication factor.

Next, we investigate the impact of churn in replicatedenvironments. First, we fix the replication factor at 4, andobserve the performance change over a set of mean sessionlengths. As can be seen in Figure 15(a), the results arefairly similar with the ones under churn in the non-replicatedenvironment. However, SELF is a little worse than PROXIMunder extreme churn. NEIGHBOR is comparable to PROXIMunder light churn, but degrades under severe and extremechurn as in no replication. Then we investigate performancesensitivity to the replication factor under light churn (i.e.s = 10000). As seen in Figure 15(b), SELF is much betterthan PROXIM across all replication factors. NEIGHBOR isfairly comparable to PROXIM under light churn despite greaterchance of random selection.

To summarize, the proposed selection techniques consis-tently outperform the conventional techniques in replicatedenvironments. The results under churn are fairly consistentwith the results without replication: SELF is comparable toPROXIM under severe churn, and NEIGHBOR is comparableto PROXIM under light churn.

V. RELATED WORK

A. Resource discovery and allocation

The work on resource discovery and allocation is closely re-lated to ours. Condor [42] provides a matchmaking frameworkwhich provides a stateless matching service. In [7], the authorspresented a decentralized matchmaking based on aggregationof resource information and CAN (Content Addressable Net-work) routing [24]. The CCOF (Cluster Computing on the Fly)project [8] seeks to harvest CPU cycles by using search meth-ods in a peer-to-peer computing environment. SWORD [27]provides distributed resource discovery by a multi-attributerange search against a DHT on which the periodic measures

of each node are stored. All these techniques focused moreon per-node characteristics of individual nodes, e.g. CPU,memory, disk space, and network interface, etc. In contrast, ourapproach is more interested in pair-wise characteristics for theresource selection with considerations of impact on end-to-enddata access. SWORD [27] provides network coordinates as thelocation information by using Vivaldi [37], but the latency isnot sufficient for bandwidth-demanding applications.

B. Network performance estimation

Much research has been carried out in network performanceestimation for selection problems over the past decade. Toinfer network performance, many estimators were employed.Some research in [35], [43]–[45] focused on estimating avail-able bandwidth, the minimum available bandwidth of the linksalong a path. Predicting RTT [20], [37], [46]–[48] has alsobeen extensively studied because it is a widely used metricin the Internet today. TCP throughput [49]–[51] is also afrequently used metric for network performance. In this paper,we employ accessibility to quantitatively determine data accesscapability between a pair of nodes.

Network performance estimation techniques fall into threeclasses with respect to the measurement methods: active prob-ing, partial probing, and passive observing. Active probinginjects a chain of packets to measure the network performance.Thus the estimation is often believed fairly accurate basedon current network conditions, but it requires not only timedelay but also incurs additional traffic overhead for the actualmeasurement. Meanwhile many latency prediction techniquesuses partial probing. These techniques infer latency of n2

pairs by O(n) probing. However, as latency may not bedirectly correlated to network bandwidth, selections relyingonly on latency can mislead bandwidth-demanding applica-tions. Passive techniques utilize past collected observations forthe estimation. This approach is attractive because it makesa timely prediction with little additional traffic. However,existing techniques are limited by topological regions (e.g.LANs or IP prefixes) for sharing observations. In contrast tothis, our passive estimation technique, neighbor estimation,has no such topological or geographical constraints, and yields

Page 12: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

1 2 4 81

2

3

4

5

6

7

8

9

10

11

12

Replication Factor

Rat

io to

Opt

imal

Impact of Replication (Mix, C=8, N=8, Allocation=50k)

RANDOMPROXIMSELFNEIGHBOR

(a) Mean ratio to optimal

1 2 4 80

10

20

30

40

50

60

Replication Factor

Mea

n E

laps

e T

ime

(sec

)

Impact of Replication (Mix, C=8, N=8, Allocation=50k)

OMNIRANDOMPROXIMSELFNEIGHBOR

(b) Mean download elapsed time

Fig. 14. Performance under replicated environments

No churn S=10000 S=1000 S=1000

1

2

3

4

5

6

7

8

9

10

11

Mean Session Length

Rat

io to

Opt

imal

Churn under Replication (Mix, C=8, N=8, R=4, Allocation=50k)

RANDOMPROXIMSELFNEIGHBOR

(a) Replication factor = 4

1 2 4 81

2

3

4

5

6

7

8

9

10

Replication Factor

Rat

io to

Opt

imal

Churn under Replication (Mix, C=8, N=8, S=10000, Allocation=50k)

RANDOMPROXIMSELFNEIGHBOR

(b) Mean session length = 10000 (light churn)

Fig. 15. Impact of churn under replication

good accuracy, sufficient for the selection problems to whichit has been applied.

VI. CONCLUSION

Accessibility is a crucial concern for an increasing numberof data-intensive applications in loosely-coupled distributedsystems. Such applications require more sophisticated resourceselection due to bandwidth and connectivity unpredictability.In this paper, we presented decentralized, scalable, and effi-cient resource selection techniques based on accessibility. Ourtechniques rely only on local, historic observations, so it ispossible to keep network overhead tolerable. We showed ourestimation techniques are sufficiently accurate to provide ameaningful rank order of nodes based on their accessibility.Our techniques outperform conventional approaches and arereasonably close to the optimal selection. In particular, theself estimation-based selection approached 1.4 of optimal overtime, the neighbor estimation-based selection was within 2.6of optimal with 16 neighbors, compared to a proximity-basedselection that was over 3 times the optimal. With respect tothe mean elapsed time, the self and neighbor estimation-basedselections were 52% and 70% more efficient respectivelythan proximity-based selection. We also investigated how ourtechniques work under node churn and showed that they work

well under churn circumstances in which nodes suffer fromloss of observations. Finally, we showed that our techniquesconsistently outperform conventional techniques in replicatedenvironments.

In this work, we focused on performance for the accessi-bility metric. The next step is to capture availability as wellas performance to take dynamism into account. In additionto this, we plan to extend our work by providing system-wide dissemination of observations so that the node has morechances to see relevant observations in estimation. This isreasonable since neighbor estimation has no constraints ontopological or geographical similarities to utilize observationscoming from other nodes.

REFERENCES

[1] D. P. Anderson and G. Fedak, “The computational and storage potentialof volunteer computing,” in Proceedings of CCGRID, 2006, pp. 73–80.

[2] A. Haeberlen, A. Mislove, and P. Druschel, “Glacier: Highly durable,decentralized storage despite massive correlated failures,” in Proceedingsof NSDI, May 2005.

[3] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gum-madi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao,“Oceanstore: An architecture for global-scale persistent storage,” inProceedings of ACM ASPLOS, November 2000.

[4] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G. M. Voelker,“Total recall: system support for automated availability management,”in Proceedings of NSDI, 2004, pp. 25–25.

Page 13: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

[5] A. Chien, B. Calder, S. Elbert, and K. Bhatia, “Entropia: architecture andperformance of an enterprise desktop grid system,” Journal of Paralleland Distributed Computing, vol. 63, no. 5, pp. 597–610, 2003.

[6] D. Kondo, A. A. Chien, and H. Casanova, “Resource management forrapid application turnaround on enterprise desktop grids,” in Proceedingsof SC, 2004, p. 17.

[7] J.-S. Kim, B. Nam, P. Keleher, M. Marsh, B. Bhattacharjee, andA. Sussman, “Resource discovery techniques in distributed desktop gridenvironments,” in Proceedings of GRID, September 2006.

[8] D. Zhou and V. Lo, “Cluster computing on the fly: resource discovery ina cycle sharing peer-to-peer system,” in Proceedings of CCGRID, 2004,pp. 66–73.

[9] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer,“Seti@home: an experiment in public-resource computing,” Communi-cations of the ACM, vol. 45, no. 11, pp. 56–61, 2002.

[10] “Search for extraterrestrial intelligence (SETI) project,http://setiathome.berkeley.edu.”

[11] “BOINC: Berkeley open infrastructure for network computing,http://boinc.berkeley.edu/.”

[12] “PPDG: Particle physics data grid, http://www.ppdg.net.”[13] N. Massey, T. Aina, M. Allen, C. Christensen, D. Frame, D. Goodman,

J. Kettleborough, A. Martin, S. Pascoe, and D. Stainforth, “Data accessand analysis with distributed federated data servers in climatepredic-tion.net,” Advances in Geosciences, vol. 8, pp. 49–56, Jun. 2006.

[14] G. B. Berriman, A. C. Laity, J. C. Good, J. C. Jacob, D. S. Katz,E. Deelman, G. Singh, M.-H. Su, and T. A. Prince, “Montage: Thearchitecture and scientific applications of a national virtual observatoryservice for computing astronomical image mosaics,” in Proceedings ofEarth Sciences Technology Conference, 2006.

[15] “BLAST: The basic local alignment search tool,http://www.ncbi.nlm.nih.gov/blast.”

[16] W. Hoschek, F. J. Jaen-Martınez, A. Samar, H. Stockinger, andK. Stockinger, “Data management in an international data grid project,”in Proceedings of GRID, 2000, pp. 77–90.

[17] Y.-M. Teo, X. Wang, and Y.-K. Ng, “Glad: a system for developingand deploying large-scale bioinformatics grid,” Bioinformatics, vol. 21,no. 6, pp. 794–802, 2005.

[18] S. Hotz, “Routing information organization to support scalable interdo-main routing with heterogeneous path requirements,” Ph.D. dissertation,1994.

[19] J. D. Guyton and M. F. Schwartz, “Locating nearby copies of repli-cated internet servers,” SIGCOMM Computer Communication Reviews,vol. 25, no. 4, pp. 288–298, 1995.

[20] E. Ng and H. Zhang, “Predicting internet network distance withcoordiantes-based approaches,” in Proceedings of IEEE INFOCOM,2002, pp. 170–179.

[21] E. Cohen and S. Shenker, “Replication strategies in unstructured peer-to-peer networks,” in Proceedings of SIGCOMM, 2002, pp. 177–190.

[22] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and replicationin unstructured peer-to-peer networks,” in Proceedings of SIGMETRICS,2002, pp. 258–259.

[23] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan,“Chord: A scalable peer-to-peer lookup service for internet applications,”in Proceedings of SIGCOMM, 2001, pp. 149–160.

[24] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker, “Ascalable content-addressable network,” in Proceedings of SIGCOMM,2001, pp. 161–172.

[25] A. Rowstron and P. Druschel, “Pastry: Scalable, distributed objectlocation and routing for large-scale peer-to-peer systems,” in IFIP/ACMInternational Conference on Distributed Systems Platforms (Middle-ware), November 2001, pp. 329–350.

[26] B. Zhao, L. Huang, J. Stribling, S. Rhea, A. Joseph, and J. Kubiatowicz,“Tapestry: A resilient global-scale overlay for service deployment,” inIEEE Journal on Selected Areas in Communications, 2003.

[27] D. Oppenheimer, J. Albrecht, D. Patterson, and A. Vahdat, “Designand implementation tradeoffs for wide-area resource discovery,” inProceedings of HPDC’05, 2005.

[28] “PlanetLab, http://www.planet-lab.org.”[29] S. G. Dykes, K. A. Robbins, and C. L. Jeffery, “An empirical evaluation

of client-side server selection algorithms,” in Proceedings of INFOCOM,2000, pp. 1361–1370.

[30] R. Wolski, “Experiences with predicting resource performance on-linein computational grid settings,” SIGMETRICS Performance EvaluationReviews, vol. 30, no. 4, pp. 41–49, 2003.

[31] K. Lai and M. Baker, “Measuring Bandwidth,” in Proceedings of IEEEINFOCOM, Mar. 1999.

[32] J. Padhye, V. Firoiu, D. F. Towsley, and J. F. Kurose, “Modeling tcp renoperformance: a simple model and its empirical validation,” IEEE/ACMTransactions on Networking, vol. 8, no. 2, pp. 133–145, 2000.

[33] Ozgur B. Akan, “On the throughput analysis of rate-based and window-based congestion control schemes,” Computer Networks, vol. 44, no. 5,pp. 701–711, 2004.

[34] Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker,“Making gnutella-like p2p systems scalable,” in Proceedings of SIG-COMM, 2003, pp. 407–418.

[35] S. Seshan, M. Stemm, and R. H. Katz, “SPAND: Shared PassiveNetwork Performance Discovery,” in Proceedings of the USENIX Sym-posium on Internet Technologies and Systems, Monterey, CA, Dec. 1997,pp. 135–146.

[36] M. Andrews, B. Shepherd, A. Srinivasan, P. Winkler, and F. Zane, “Clus-tering and server selection using passive monitoring,” in Proceedings ofINFOCOM, 2002, pp. 1717–1725.

[37] F. Dabek, R. Cox, F. Kaashoek, and R. Morris, “Vivaldi: a decentralizednetwork coordinate system,” in Proceeding of SIGCOMM, 2004, pp. 15–26.

[38] R. Zhang, C. Tang, Y. C. Hu, S. Fahmy, and X. Lin, “Impact of theinaccuracy of distance prediction algorithms on internet applications -an analytical and comparative study,” in Proceedings of INFOCOM,2006.

[39] L. Tang and M. Crovella, “Virtual landmarks for the internet,” in IMC’03: Proceedings of the 3rd ACM SIGCOMM conference on Internetmeasurement. New York, NY, USA: ACM, 2003, pp. 143–152.

[40] “FreePastry, http://freepastry.org.”[41] H. Yu, P. B. Gibbons, and S. Nath, “Availability of multi-object opera-

tions,” in Proceedings of NSDI, 2006.[42] R. Raman, M. Livny, and M. Solomon, “Matchmaking: Distributed

resource management for high throughput computing,” in Proceedingsof HPDC, 1998, p. 140.

[43] A. B. Downey, “Using pathchar to estimate internet link characteristics,”in Proceedings of ACM SIGCOMM, 1999, pp. 241–250.

[44] S. Keshav, “Packet-pair flow control,” IEEE/ACM Transactions onNetworking, 1995.

[45] R. L. Carter and M. Crovella, “Server selection using dynamic pathcharacterization in wide-area networks,” in Proceedings of INFOCOM,1997, pp. 1014–1021.

[46] P. Francis, S. Jamin, C. Jin, Y. Jin, D. Raz, Y. Shavitt, and L. Zhang,“Idmaps: a global internet host distance estimation service,” IEEE/ACMTransactions on Networking, vol. 9, no. 5, pp. 525–540, 2001.

[47] B. Wong, A. Slivkins, and E. G. Sirer, “Meridian: a lightweight networklocation service without virtual coordinates,” SIGCOMM ComputerCommunication Reviews, vol. 35, no. 4, pp. 85–96, 2005.

[48] M. Costa, M. Castro, A. Rowstron, and P. Key, “Pic: Practical internetcoordinates for distance estimation,” in International Conference onDistributed Systems, 2004.

[49] R. Wolski, “Dynamically forecasting network performance using thenetwork weather service,” Cluster Computing, vol. 1, no. 1, pp. 119–132, 1998.

[50] Q. He, C. Dovrolis, and M. Ammar, “On the predictability of largetransfer tcp throughput,” in Proceedings of SIGCOMM. New York,NY, USA: ACM, 2005, pp. 145–156.

[51] “PlanetLab Iperf, http://jabber.services.planet-lab.org/php/iperf.”

Jinoh Kim received his BE and MS degrees in com-puter science and engineering from Inha University,Korea, in 1991 and 1994, respectively. From 1991 to2005, he was with ETRI, Korea, where he worked onATM management, IP over ATM, network security,and policy-based security management. His researchinterests are distributed computing including peer-to-peer computing and high-performance computing,distributed systems, and network security and man-agement. He is a member of the IEEE and the IEEEComputer Society.

Page 14: Using Data Accessibility for Resource Selection in …jon/papers/tpds_20_6.pdfUsing Data Accessibility for Resource Selection in Large-scale Distributed Systems Jinoh Kim, Member,

Abhishek Chandra received his B.Tech. degreein Computer Science and Engineering from IITKanpur, India, in 1997, and M.S. and PhD degreesin Computer Science from the University of Mas-sachusetts Amherst in 2000 and 2005 respectively.He is currently an Assistant Professor in the De-partment of Computer Science and Engineering atthe University of Minnesota. His research interestsare in the areas of Operating Systems, DistributedSystems, and Computer Networks. He received theNSF CAREER award in 2007, and his dissertation

titled ”Resource Allocation for Self-Managing Servers” was nominated forthe ACM Dissertation Award in 2005. He is a member of ACM, IEEE, andUSENIX.

Jon B. Weissman is a leading researcher in thearea of high performance distributed computing. Hisinvolvement dates back to the influential Legionproject at the University of Virginia during hisPh.D. He is currently an Associate Professor ofComputer Science at the University of Minnesotawhere he leads the Distributed Computing SystemsGroup. His current research interests are in Gridcomputing, distributed systems, high performancecomputing, resource management, reliability, ande-science applications. He works primarily at the

boundary between applications and systems. He received his B.S. degree fromCarnegie-Mellon University in 1984, and his M.S. and Ph.D. degrees from theUniversity of Virginia in 1989 and 1995, respectively, all in computer science.He is a senior member of the IEEE and awardee of the NSF CAREER Award(1995).