accelerating large scale scientific exploration through data...

Accelerating Large Scale Scientific Exploration through Data Diffusion Ioan Raicu*, Yong Zhao*, Ian Foster#*+, Alex Szalay -

{iraicu,yongzh }@cs.uchicago.edu, [email protected], [email protected] *Department of Computer Science, University of Chicago, IL, USA

+Computation Institute, University of Chicago & Argonne National Laboratory, USA #Math & Computer Science Division, Argonne National Laboratory, Argonne IL, USA

- Department of Physics and Astronomy, The Johns Hopkins University, Baltimore MD, USA

Abstract Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a “data diffusion” approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and “cached” to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-to-peer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have developed Falkon, which provides dynamic acquisition and release of resources and the dispatch of analysis tasks to those resources. We have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astro-physics, medicine, and other domains, with varying datasets, workloads, and analysis codes.

1 Overview We want to support interactive analysis of large quantities of data, a requirement that arises, for example, in many scientific disciplines. Such analyses require turnaround measured in minutes or seconds. Achieving this performance can demand hundreds of

computers to process what may be many terabytes of data. As the applications scale, data sets grow, and resources used increase, the importance of data locality will be crucial to the successful and efficient use of large scale distributed systems for many scientific and data-intensive applications. [9]

One approach to delivering such performance, adopted, for example, by Google [5, 13], is to build large compute-storage farms dedicated to storing data and responding to user requests for processing. However, such approaches can be expensive (in terms of idle resources) if load varies significantly over the two dimensions of time and/or the data of interest.

Grid infrastructures allow for an alternative data diffusion approach, in which the resources required for data analysis are acquired dynamically, in response to demand. As request rates increase, more resources are acquired, either “locally,” if available, or “remotely” if not; the location only matters in terms of associated cost tradeoffs. Both data and applications can diffuse from low-cost archival or slower disk storage to newly acquired resources for processing. Acquired resources (computer and associated storage, and also data) can be “cached” for some time, thus allowing more rapid responses to subsequent requests. If demand drops, resources can be released, allowing for their use for other purposes. In principle, this approach can provide the benefits of dedicated hardware without the associated high costs—depending crucially, of course, on the nature of application workloads and the performance characteristics of the grid infrastructures on which the system is deployed.

This data diffusion concept is reminiscent of cooperative Web caching [22] and peer-to-peer storage systems [20]. (Other data-aware scheduling approaches tend to assume static resources [1, 2, 3, 4, 10].) Also, others have performed simulations on data-aware scheduling [12]. But the devil is in the details, and the details are different in many respects. We need to allocate not just storage but also computing power. Users do not want to retrieve the data, but to analyze it. Datasets may be terabytes in size. Further complicating the situation is our limited knowledge of workloads.

In order to explore this concept and gain experience with its practical realization, we are developing a Fast and Light-weight tasK executiON framework (Falkon) [6, 14], which provides for dynamic acquisition and release of resources (“workers”) and the dispatch of analysis tasks to those workers. We describe here how Falkon has been extended to include data caching, thus enabling the data management of tens of millions of files spanning potentially hundreds to thousands of multiple storage resources. This paper presents the design of the data caching extensions to Falkon along with a performance analysis using both micro-benchmarks and a large scale astronomy application.

2 Motivation Many applications are composed of pieces that compute on some data, and hence it is worthwhile to look at a real production Grid, namely the TeraGrid (TG) distributed infrastructure.

The TG is the world's largest, most comprehensive distributed cyber-infrastructure for open scientific research. Currently, TG resources include more than 250 teraflops of computing capability (4.8K nodes with over 14K processors) and more than 1 petabyte of spinning disk and 30 petabytes of archival data storage, with rapid access and retrieval over high-performance networks (10~30 Gb/s links). Of the 4.8K nodes in the TG, 4.2K nodes have local dedicated disks that are underutilized (both in terms of storage and I/O bandwidth). With an average modest disk size of 67GB per node, this totals to 283TB of disk in addition to the 1PB of shared storage resources. The locally installed disks on most TG nodes have an aggregate theoretical peak throughput on the order of 3000 Gb/s, in contrast with the shared storage resources that have an aggregate peak throughput on the order of 100 Gb/s.

We have experience with certain astronomy specific data access patterns on the TG that produce an order of magnitude difference in performance between processing data directly from local disk as opposed to accessing the data from shared storage resources (i.e. GPFS) [7, 8, 176]. With more than an order of magnitude performance improvement when using local disks, there is good motivation to have middleware capable of harnessing this relatively idle storage resource transparently for large scale scientific applications.

It has been argued that data intensive applications cannot be executed efficiently in grid environments because of the high costs of data movement. But if data analysis workloads have internal locality of reference [177], then it can be feasible to acquire and use even remote resources, as high initial data movement costs can be offset by many subsequent data analysis

operations performed on that data. We can imagine data diffusing over an increasing number of CPUs as demand increases, and then contracting as load reduces.

We envision “data diffusion” as a process in which data is stochastically moving around in the system, and that different applications can reach a dynamic equilibrium this way. One can think of a thermodynamic analogy of an optimizing strategy, in terms of energy required to move data around (“potential wells”) and a “temperature” representing random external perturbations (“job submissions”) and system failures. We propose exactly such a stochastic optimizer.

3 Hypothesis The analysis of large datasets typically follows a split/merge pattern, which includes an analysis query to be answered, which get split down into independent tasks to be computed, after which the results from all the tasks are merged back into a single aggregated result. The hypothesis is that significant performance improvements can be obtained in the analysis of large dataset by leveraging information about data analysis workloads rather than individual data analysis tasks. We define workloads to be a complex query that can be decomposed into simpler tasks, or a set of queries that together answer some broader analysis questions. As the size of scientific data sets and the resources required for analysis increase, data locality becomes crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications [9]. We hypothesize that significant performance improvements can be obtained in the analysis of large datasets by leveraging information about data analysis workloads rather than individual data analysis tasks. We believe it is feasible to allocate large-scale computational resources and caching storage resources that are relatively remote from the original data location, co-scheduled together to optimize the performance of entire data analysis workloads.

4 Related Work Full-featured local resource managers (LRMs) such as Condor [24, 38], Portable Batch System (PBS) [39], Load Sharing Facility (LSF) [40], and SGE [41] support client specification of resource requirements, data staging, process migration, check-pointing, accounting, and daemon fault recovery. Falkon, in contrast, is not a full-featured LRM: it focuses on efficient task dispatch and thus can omit some of these features in order to streamline task submission. This narrow focus is possible because Falkon can rely on

LRMs for certain functions (e.g., accounting) and clients for others (e.g., recovery, data staging).

The BOINC “volunteer computing” system [42, 43] has a similar architecture to that of Falkon. BOINC’s database-driven task dispatcher is estimated to be capable of dispatching 8.8M tasks per day to 400K workers. This estimate is based on extrapolating from smaller synthetic benchmarks of CPU and I/O overhead, on the task distributor only, for the execution of 100K tasks. By comparison, Falkon has been measured to execute 2M (trivial) tasks in two hours, and has scaled to 54K managed executors with similarly high throughput. This test as well as other throughput tests achieving 487 tasks/sec suggest that Falkon can provide higher throughput than BOINC.

Multi-level scheduling has been applied at the OS level [27, 30] to provide faster scheduling for groups of tasks for a specific user or purpose by employing an overlay that does lightweight scheduling within a heavier-weight container of resources: e.g., threads within a process or pre-allocated thread group.

Frey et al. pioneered the application of this principle to clusters via their work on Condor “glide-ins” [32]. Requests to a batch scheduler (submitted, for example, via Globus GRAM4 [50]) create Condor “startd” processes, which then register with a Condor resource manager that runs independently of the batch scheduler. Others have also used this technique. For example, Mehta et al. [36] embed a Condor pool in a batch-scheduled cluster, while MyCluster [34] creates “personal clusters” running Condor or SGE. Such “virtual clusters” can be dedicated to a single workload. Thus, Singh et al. find, in a simulation study [35], a reduction of about 50% in completion time, due to reduction in queue wait time. However, because they rely on heavyweight schedulers to dispatch work to the virtual cluster, the per-task dispatch time remains high.

In a different space, Bresnahan et al. [48] describe a multi-level scheduling architecture specialized for the dynamic allocation of compute cluster bandwidth. A modified Globus GridFTP server varies the number of GridFTP data movers as server load changes.

Appleby et al. [46] were one of several groups to explore dynamic resource provisioning within a data center. Ramakrishnan et al. [47] also address adaptive resource provisioning with a focus primarily on resource sharing and container level resource management. Our work differs in its focus on resource provisioning on non-dedicated resources managed by LRMs.

Data Management: The Globus Toolkit includes two components (Replica Location Service [126] and Data Replication Service [127]) that can be used to build

data management services for Grid environments. Several large projects (Mobius [130] and ATLAS [128]) implemented their own data management systems to aid in the respective application’s implementations. Google also has their own data management system called BigTable, a distributed storage system for structured data [144]. Finally, GFarm is a Grid file system that supports high-performance distributed and parallel data computing [117]. Yamamoto also explores the use of GFarm for petascale data intensive computing [118].

Coupling Data and Compute Management: Data management on its own is useful, but not as useful as it could be if it were to be coupled with compute resource management as well. Ranganathan et al. performed simulation studies of computation and data Scheduling algorithms for data Grids [125]. The GFarm team implemented a data aware scheduler in Gfarm using LSF scheduler plugin mechanism [2]. Finally, from Google, the combination of BigTable, Google File System (GFS) [5], and MapReduce [146] yields a system that could potentially have all the advantages of our proposed data-centric task farms, in which computations and data are co-located on the same resources. Although both Google and GFarm addresses the coupling of data and computational resources, they both assume a relatively static set of resources, one that is not commonly found in today’s Grids. In our work, we further extend this fusion of data and compute resource management by also enabling dynamic resource provisioning, which essentially allows our solution to operate over today’s batch-based Grids. It is worth noting that both Google’s approach and GFarm require dedicated resources, which makes the problem more tractable when compared to our solution which assumes that the resources are shared among many users through batch-scheduled systems, which leases some set of resources for finite periods of time.

4.1 Astronomy Applications Other work is underway to apply TeraGrid resources to explore and analyze the large datasets being federated within the NSF National Virtual Observatory (NVO) [102, 105]. For example, Montage [103] is a portable, compute-intensive, custom astronomical image mosaic service that, like AstroPortal, makes extensive use of grid computing. Our work is distinguished by its focus on the “stacking” problem and by its architecture that allows for flexible and dynamic provisioning of data and computing resources.

HPDC related work

As opposed to traditional solutions to this problem (i.e. distributed file systems such as GFS [cite], GFarm [cite], dCache [cite], and GPFS [cite]) that are typically

more rigid and need to be statically installed and configured, we have opted for a more dynamic application level solution.

Data management on its own is useful, but not as useful as it could be if it were to be coupled with compute resource management as well. Ranganathan et al. performed simulation studies of computation and data Scheduling algorithms for data Grids [125]. The GFarm team implemented a data aware scheduler in Gfarm using LSF scheduler plugin mechanism [2]. Finally, from Google, the combination of BigTable, Google File System (GFS) [5], and MapReduce [146] yields a system that could potentially have all the advantages of our proposed data-centric task farms, in which computations and data are co-located on the same resources. Although both Google and GFarm addresses the coupling of data and computational resources, they both assume a relatively static set of resources, one that is not commonly found in today’s Grids. In our work, we further extend this fusion of data and compute resource management by also enabling dynamic resource provisioning, which essentially allows our solution to operate over today’s batch-based Grids. It is worth noting that both Google’s approach and GFarm generally require dedicated resources, which makes the problem more tractable when compared to our solution which assumes that the resources are lease-based and must be obtained through batch-scheduled managers.

4.2 Background Information: The Bigger Picture

Some of our recent work involving resource provisioning [cite] and task dispatching [cite] address some of the issues mentioned in the overview section. This section describes our ongoing work and how 3DcacheGrid is a key missing piece to create a highly optimized system for the execution of data-intensive applications.

Falkon: To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon uses (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Multi-level scheduling, introduced in operating systems research in the 1990s, has been applied to clusters by the Condor [cite] team and others, while streamlined dispatchers are found in, e.g, BOINC [cite]. Falkon’s integration of the techniques delivers performance not provided by any other system. Microbenchmarks show that Falkon throughput (440 tasks/sec) and scalability (54,000 executors, 2,000,000 queued tasks) are one to two orders of magnitude better

than other resource managers. [cite] We have separated the two tasks of provisioning and scheduling within Falkon. We have defined and implemented a dynamic resource provisioning (DRP) component that has various allocation and de-allocation policies, and can adaptively provision resources for Falkon even in light of varying workloads. Microbenchmarks show that DRP can allocate resources on the order of 10s of seconds across multiple Grid sites and can reduce average queue wait times by up to 95% (effectively yielding queue wait times within 3% of ideal). [cite] The proposed 3DcacheGrid system fits nicely within the Falkon architecture to handle the data management for applications transparently, as well as to expose the necessary information to the Falkon scheduler to make it data-aware.

Applications: Prior to the work presented in this paper, we had mostly left data management issues to shared file systems as both Swift and Falkon used to depend on a shared file system within each Grid site. This has worked very well for both synthetic experimentation and for non-data intensive applications. However, there are data-intensive applications that we have not been able to scale to large number of processors due to the large datasets and particular data access patterns (many random small I/O reads/writes) on the shared file system. Applications (executed by the Swift parallel programming system [cite]) reduce end-to-end run time of up to 90% for large-scale astronomy and medical applications, relative to versions that execute tasks via separate scheduler submissions. The astronomy applications include image stacking services (i.e. AstroPortal [cite]), and mosaic services (i.e. Montage [cite]); the datasets (digital imaging data) used by these applications are SDSS [11], GSC-II [181], 2MASS [182], and POSS-II [183]. These datasets are generally large (multiple terabytes) and contain many objects (100 million +) separated into many files (1 million +). Another application set we identified is from the astro-physics domain which would utilize simulation data (as opposed to image data from the astronomy applications) from the Flash dataset [cite] from ASC / Alliances Center for Astrophysical Thermonuclear Flashes. The applications that we have identified are volume rendering and vector visualization. The dataset is composed of 32 million files (1000 time steps times 32K files) taking up about 15TB of storage resources. These datasets contain both temporal and spatial locality, which should offer ample optimization opportunities within the proposed data management system.

5 Architecture

5.1 Falkon [6] To enable the rapid execution of many tasks on distributed resources, Falkon combines (1) multi-level scheduling [15, 16] to separate resource acquisition (i.e. requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher to achieve one to two orders of magnitude higher throughput (487 tasks/sec) and scalability (54K executors, 2M queued tasks) than other resource managers [6].

The Falkon architecture comprises a set of (dynamically allocated) executors, that cache and analyze data; a dynamic resource provisioner (DRP) that manages the creation and deletion of executors; and a dispatcher, that dispatches each incoming task to an executor. The provisioner uses tunable allocation and de-allocation policies to provision resources adaptively. Microbenchmarks show that DRP can allocate resources in 10s of seconds across multiple Grid sites and reduce average queue wait times by up to 95% (effectively yielding queue wait times within 3% of ideal) [14].

Prior to the work presented in this paper, we have assumed that each task scheduled by Falkon accessed its input and output at remote locations (i.e. shared file system, gridFTP server, web server, etc). This strategy works surprisingly well in many cases, but of course cannot scale to more data-intensive applications. Example of applications in which problems arise include image stacking [7, 8] and mosaic services [17, 18] in astronomy, which access digital image datasets that are typically large (multiple terabytes) and contain many objects (100M+) stored into many files (1M+).

Finally, Falkon support in the Swift parallel programming system [19] allows Swift applications to run over Falkon without modification.

5.1.1 Falkon Architecture Falkon consists of a dispatcher, a provisioner, and zero or more executors (Figure 1). Figure 2 has the series of message exchanges that occur between the various Falkon components. As we describe the architecture and the components' interaction, we will denote the message numbers from Figure 2 in square braces; some messages have two numbers, denoting both a send and receive, while others have only a single number, denoting a simple send.

The dispatcher accepts tasks from clients and implements the dispatch policy. The provisioner implements the resource acquisition policy. Executors run tasks received from the dispatcher. Components communicate via Web Services (WS) messages (solid

lines in Figure 2), except that notifications are performed via a custom TCP-based protocol (dotted lines).

Figure 1: Falkon architecture overview

Figure 2: Falkon components and message exchange

The dispatcher implements the factory/instance pattern, providing a create instance operation to allow a clean separation among different clients. To access the dispatcher, a client first requests creation of a new instance, for which is returned a unique endpoint reference (EPR). The client then uses that EPR to submit tasks {1,2}, monitor progress (or wait for notifications {8}), retrieve results {9,10}, and (finally) destroy the instance.

A client “submit” request takes an array of tasks, each with working directory, command to execute, arguments, and environment variables. It returns an array of outputs, each with the task that was run, its

return code, and optional output strings (STDOUT and STDERR contents). A shared notification engine among all the different queues is used to notify executors that work is available for pick up. This engine maintains a queue, on which a pool of threads operate to send out notifications. The GT4 container also has a pool of threads that handle WS messages. Profiling shows that most dispatcher time is spent communicating (WS calls, notifications). Increasing the number of threads should allow the service to scale effectively on newer multicore and multiprocessor systems.

The dispatcher runs within a Globus Toolkit 4 (GT4) [51] WS container, which provides authentication, message integrity, and message encryption mechanisms, via transport-level, conversation-level, or message-level security [52].

The provisioner is responsible for creating and destroying executors. It is initialized by the dispatcher with information about the state to be monitored and how to access it; the rule(s) under which the provisioner should create/destroy executors; the location of the executor code; bounds on the number of executors to be created; bounds on the time for which executors should be created; and the allowed idle time before executors are destroyed.

The provisioner periodically monitors dispatcher state {POLL} and, based on policy, determines whether to create additional executors, and if so, how many, and for how long. Creation requests are issued via GRAM4 [50] to abstract LRM details.

A new executor registers with the dispatcher. Work is then supplied as follows: the dispatcher notifies the executor when work is available {3}; the executor requests work {4}; the dispatcher returns the task(s) {5}; the executor executes the supplied task(s) and returns results, including return code and optional standard output/error strings {6}; and the dispatcher acknowledges delivery {7}.

5.1.2 Falkon Execution Model Each task is dispatched to a computational resource, selected according to the dispatch policy. If a response is not received after a time determined by the replay policy, or a failed response is received, the task is re-dispatched according to the dispatch policy (up to some specified number of retries). The resource acquisition policy determines when and for how long to acquire new resources, and how many resources to acquire. The resource release policy determines when to release resources.

Dispatch policy. We consider here a next-available policy, which dispatches each task to the next available

resource. We assume here that all data needed by a task is available in a shared file system.

Resource acquisition policy. This policy determines the number of resources, n, to acquire; the length of time for which resources should be requested; and the request(s) to generate to LRM(s) to acquire those resources. We have implemented five strategies that variously generate a single request for n resources, n requests for a single resource, or a series of arithmetically or exponentially larger requests, or that use system functions to determine available resources. Due to space restrictions, in the experiments reported in this paper, we consider only the first policy (“all-at-once”), which allocates all needed resources in a single request.

Resource release policy. We distinguish between centralized and distributed resource release policies. In a centralized policy, decisions are made based on state information available at a central location. For example: “if there are no queued tasks, release all resources” or “if the number of queued tasks is less than q, release a resource.” In a distributed policy, decisions are made at individual resources based on state information available at the resource. For example: “if the resource has been idle for time t, the resource should release itself.” Note that resource acquisition and release policies are typically not independent: in most batch schedulers, a set of resources allocated in a single request must all be de-allocated before the requested resources become free and ready to be used by the next allocation. Ideally, one must release all resources obtained in a single request at once, which requires a certain level of synchronization among the resources allocated within a single allocation. In the experiments reported in this paper, we used a distributed policy, releasing individual resources after a specified idle time was reached. In the future, we plan to improve our distributed policy by coordinating between all the resources allocated in a single request to de-allocate all at the same time.

5.2 Enhancing Falkon with Data Diffusion The data-diffusion’s primary goal is to allow data-intensive applications to achieve a good separation of concerns between their core logic and the complicated data management task of large data sets, while improving the disk subsystems’ utilizations and ultimately the end application performance. We found that the storage resource latency and bandwidth can significantly impact data-intensive application performance; we propose to utilize a tiered hierarchy of storage resources to automatically cache data in faster and smaller data storage tiers, improving applications

performance that exhibits data locality in its data access patterns.

We introduce data diffusion by incorporating data caches in executors and data location-aware task scheduling algorithms in the dispatcher. In our initial implementation, individual executors manage their own cache content, using local cache eviction policies, and communicate changes to cache content to the dispatcher. The dispatcher can then associate with each task sent to an executor information about where to find non-cached data. In the future, we will also analyze and (if analysis results are encouraging) experiment with alternative approaches, such as decentralized indices and centralized management of cache content.

5.2.1 Data Diffusion Architecture

We define and implement a hybrid centralized/de-centralized management system for the data diffusion, in which there exists both a centralized index for ease of finding the needed data, and a decentralized set of indexes (one per storage resource) that manage the actual cache content and cache eviction policies for the corresponding storage resource. This hybrid architecture allows the data diffusion implementation to strike a good balance between latency to the data and good scalability.

The data diffusion essentially handles the data indexing necessary for efficient data discovery and access, as well as decides the cache content (the cache eviction policy). It offers efficient management for large datasets (in both number of files, size of datasets, number of storage resources used, number of replicas, and the size of the query performed), and at the same time, offers performance improvements to those applications which have data locality in their respective workloads and data access patterns through effective caching strategies.

Finally, we have integrated the data diffusion into Falkon [cite], and have implemented a data-aware scheduler that can minimize the overall network traffic that is generated by placing computational tasks close to the data resource. In order to gain access to a large class of applications, we have added Falkon support in the Swift parallel programming system [cite], which ultimately leads to the ability for Swift enabled applications to transparently run over Falkon as well as the data diffusion without any application modifications.

Error! Reference source not found. shows the architecture overview of Falkon with both the data management and data-aware scheduler enabled. The

3DcacheGrid architecture is a hybrid one, which is both centralized (the 3DcacheGrid engine) and distributed (the executor index service on each compute node). The data-aware scheduler performs a centralized query to locate the data caches corresponding to particular tasks, after each compute resource that receives tasks to process perform their own local lookups to locate the data caches. The actual management and read/write of the particular data cache entry would be done directly on the corresponding storage resource in a completely distributed fashion. Each storage resource maintains its own cache eviction policy and has full control over its resources. Each storage resource has the ability to get data from globally accessible storage resources (i.e. a well defined dataset on a GridFTP server, or on a shared file system). Furthermore, they can each get data from other storage resources that have a cached copy of the original data. Each executor learns about where the needed cached data is located from the centralized 3DcacheGrid engine. It is worth noting that when the scheduling decision is made at the data-aware scheduler, for very little extra costs, we can bundle the already computed data cache locations into the task description as it is dispatched to compute nodes. An alternative approach to this hybrid approach would have been to have the executor caches be completely distributed, requiring any executor to find other caches via traditional distributed methods; while this approach might be ore scalable, it would have imposed significantly higher latencies, and likely would have negatively impacted the rate at which tasks could be processed.

Figure 3: Architecture overview of Falkon extended with the data diffusion (data management and data-aware scheduler)

5.2.2 Data Diffusion Execution Model We assume that data are never modified after initial creation. Hence, we need not check if cached data is

stale. This assumption is valid for Swift applications with which we work. We have implemented four well-known cache eviction policies [21]: Random, FIFO (First In First Out), LRU (Least Recently Used), and LFU (Least Frequently Used).

Data caching at executors implies the need for data-aware scheduling. We implement three policies: first-available, max-cache-hit, and max-compute-util.

The first-available policy ignores data location information when selecting an executor for a task; it simply chooses the next available executor. Thus, all data needed by a task must, with high probability, be transferred from the globally accessible storage resource, or another storage resource that has the data cached.

The max-cache-hit policy uses information about data location to dispatch each task to the executor with the largest number of data needed by that task. If that executor is busy, task dispatch is delayed until the executor becomes available. This strategy can be expected to reduce data movement operations relative to first-available, but may lead to load imbalances, especially if data popularity is not uniform.

The max-compute-util policy also leverages data location information, but in a different way. It always sends a task to an available executor, but if several workers are available, it selects that one that has the most data needed by the task.

In all three cases, we pass with each task information about remote resources (workers or storage systems) on which missing data can be found. Thus, the executor can fetch needed data without further lookups.

6 Performance Evaluation

6.1 Testbed Description Table 1 lists the platforms used in experiments. Latency between these systems was one to two milliseconds. We assume a one-to-one mapping between executors and nodes in all experiments. There were in total 162 nodes in TG_ANL_IA32 and TG_ANL_IA64, with around 100 nodes typically available for our experiments.

Table 1: Platform descriptions

Name # of Nodes Processors Memory Network

TG_ANL_IA32 98 Dual Xeon 2.4GHz 4GB 1Gb/s

TG_ANL_IA64 64 Dual Itanium 1.5GHz 4GB 1Gb/s

TP_UC_x64 122 Dual Opteron 2.2GHz 4GB 1Gb/s

UC_x64 1 Dual Xeon 3GHz w/ HT 2GB 100 Mb/s

UC_IA32 1 Intel P4 2.4GHz 1GB 100 Mb/s

6.2 Data-Aware Scheduler Performance This section aims solely at evaluating the dispatcher’s ability to handle lookups of data location information, and keeping the data management state information up-to-date. In these experiments, we assume a dataset of 1.5M files (each being 6MB) and a set of task requests each of which requires F number of these files, selected randomly according to a uniform distribution (this is representative of the SDSS astronomy dataset [11]), which we also investigate further for some real application workloads. We assume that each executor has 100GB of local storage (this is representative of the TeraGrid [23] local storage resources on each compute node). We run the Falkon dispatcher using the max-compute-util policy, with local threads emulating the task receipt and cache update operations performed by workers; no actual computation or data transfer is performed in these experiments.

Figure 4 shows measured throughput in terms of tasks per second. We see that the number of files per task that the dispatcher can handle while maintaining at least 100 tasks/sec range from around 512 (with a single storage resource) to 64 (for 32K storage resources) files per task. These results are encouraging in terms of achievable scheduling rates and data management overhead. Note that throughput scheduling drops dramatically (less than 1 task/sec) as the number of files per task approaches and surpasses 1024, and the number of storage resources surpasses 8192.

1 4 16 64 2561024

18

64512

409632768

0.1

1

10

100

1000

10000

Thro

ughp

ut

(task

s/se

c)

Number of Files per Task

Number of Storage

Resources

Figure 4: Data-aware scheduler performance for max-compute-util

The following graph shows the performance of the various scheduling policies in terms of cache hit rates achieved on a workload that is repeated four times. Its interesting to note that the first-available policy yields relatively poor cache hit rates beyond more than two nodes. The max-compute-util policy performs significantly better, with a slight and steady decline from 100% to 97% cache hit rates for up to 64 nodes. Finally, the max-cache-hit policy has the best cache hit ratio (100% in all cases), but at the expense of the potential of poor load balancing for workloads that don’t have the access patterns uniformly distributed.

100%100%100% 100%100%100%

63%

99%100%

31%

99%100%

16%

99%100%

8%

97%100%

4%

97%100%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cac

he H

it R

ate

(loca

l dis

k)

1 2 4 8 16 32 64Number of Nodes

first-availablemax-compute-utilmax-cache-hit

Figure 5: Cache hit performance of the 3 different scheduling policies

6.3 Modeling Local Disk and Shared File System Performance

In order to understand how well the proposed data diffusion works, we decided to model the shared file system (GPFS) performance in the ANL/UC TG cluster where we conducted all our experiments. The following graphs represent 160 different experiments, covering 19.8M files transferring 3.68TB of data and consuming 162.8 CPU hours; the majority of this time was spent measuring the GPFS performance, but a

small percentage was also spent measuring the local disk performance of a single node. The dataset we used was composed of 5.5M files making up 2.4TB of data. In the hopes to eliminate as much variability or bias as possible from the results (introduced by Falkon itself), we wrote a simple program that took in some parameters, such as the input list of files, output directory, length of time to run experiment (while never repeating any files for the corresponding experiment); the program then randomized the input files and ran the workload of reading or reading+writing the corresponding files in 32KB chunks (larger buffers than 32KB didn’t offer any improvement in read/write performance for our testbed). Experiments were ordered in such a manner that the same files would only be repeated after many other accesses, making the probability of those files being in any cache small.

Most graphs (unless otherwise noted) represent the GPFS read or read+write performance for 1 to 64 (1, 2, 4, 8, 16, 32, 64) concurrent nodes accessing files ranging from 1 byte to 1GB in size (1B, 1KB, 10KB, 100KB, 1MB, 10MB, 100MB, 1GB).

As a first comparison, we measured the read and read+write performance of one node as it performed the operations on both the local disk and the shared file system. Figure 6 shows that local disk is more than twice as fast when compared to the shared file system performance in all cases, a good motivator to favor local disk access whenever possible. Note that this gap would likely grow as the access patterns shift from few large I/O calls (as was the case in these experiments) to many small I/O calls (which is the case for many astronomy applications and datasets).

0100200300400500600700800900

10001100

0.000

001

0.000

01

0.000

10.0

01 0.01 0.1 1 10 10

010

00

File Size (MB)

Thro

ughp

ut (M

b/s)

Local ReadLocal Read+WriteGPFS ReadGPFS Read+Write

Figure 6: Comparing performance between the local disk and the shared file system GPFS from one node

Notice that the GPFS read performance (Figure 7) tops out at 3420 Mb/s for large files, and it can achieve 75% of its peak bandwidth with files as small as 1MB if

there are enough nodes concurrently accessing GPFS. Its worth noting that the performance increase beyond 8 nodes is only apparent for small files; for large files, the difference is small (<6% improvement from 8 nodes to 64 nodes). This is due to the fact that there are 8 I/O servers serving GPFS, and 8 nodes are enough to saturate the 8 I/O servers given large enough files.

0

500

1000

1500

2000

2500

3000

3500

4000

1E-06

0.001 0.0

1 0.1 1 10 100

1000

File Size (MB)

Rea

d Th

roug

hput

(Mb/

s)

64 Nodes32 Nodes16 Nodes8 Nodes4 Nodes2 Nodes1 Node

Figure 7: Read performance for GPFS expressed in Mb/s; only the x-axis is logarithmic; 1-64 nodes for GPFS; 1B – 1GB files

The read+write performance (Figure 8) is lower than that of the read performance, as it tops out at 1123Mb/s. Just as in the read experiment, there seems to be little gain from having more than 8 nodes concurrently accessing GPFS (with the exception of small files).

0

500

1000

1500

2000

2500

3000

3500

4000

1E-06

0.001 0.0

1 0.1 1 10 100

1000

File Size (MB)

Rea

d+W

rite

Thro

ughp

ut (M

b/s)


Figure 8: Read+write performance for GPFS expressed in Mb/s; only the x-axis is logarithmic; 1-64 nodes for GPFS; 1B – 1GB files

These final two graphs shows the theoretical read+write and read throughput (measured in Mb/s) for local disk access. These results are theoretical, as they are simply a derivation of the 1 node performance, extrapolated to additional nodes (2, 4, 8, 16, 32, 64) linearly (assuming that local disk accesses are completely independent of each other across different

nodes). Notice the read+write throughput approaches 25GB/s (up from 1Gb/s for GPFS) and the read throughput 76Gb/s (up from 3.5Gb/s for GPFS). This upper bound potential is a great motivator for applications to favor the use of local disk over that of shared disk, especially as applications scale beyond the size of the statically configured number of I/O servers servicing the shared file systems normally found in Grid environments.

0

10000

20000

30000

40000

50000

60000

70000

1E-06

0.001 0.0

1 0.1 1 10 100

1000

File Size (MB)R

ead

Thro

ughp

ut (M

b/s)


Figure 9: Theoretical read performance of local disks expressed in Mb/s; only the x-axis is logarithmic; 1-64 nodes for local disk access; 1B – 1GB files

0

10000

20000

30000

40000

50000

60000

70000

1E-06

0.001 0.0

1 0.1 1 10 100

1000

File Size (MB)

Rea

d+W

rite

Thro

ughp

ut (M

b/s)


Figure 10 (local model 1-64 nodes r+w): Theoretical read+write performance of local disks expressed in Mb/s; only the x-axis is logarithmic; 1-64 nodes for local disk access; 1B – 1GB files

6.4 Measuring the Data Diffusion Performance

The results in this section covers 672 different experiments, which took a total of 1016 CPU hours to complete. The experiments covered various configurations and were done for varying number of nodes (1, 2, 4, 8, 16, 32, 64) and varying number of file sizes (1B, 1KB, 10KB, 100KB, 1MB, 10MB, 100MB,

1GB). For all experiments (with the exception of the 100% data locality experiments), the data lived on the shared file system initially.

We defined 8 different configurations to test and compare, for which each had two variations (read and read+write). The 8 configurations are:

1. Model (local disk): performance of the local disk as described in the previous section (Figure 9 and Figure 10)

2. Model (shared file system): performance of the shared file system (GPFS) as described in the previous section (Figure 7 and Figure 8)

3. Falkon (next-available policy): performance of Falkon using next-available scheduling policy, which does not include any data management or data aware scheduling; the scheduler simply does load balancing across its available executors and performs its operations to and from the shared file system

4. Falkon (next-available policy) + Wrapper: this is the same as configuration (3) above, however it performs all task executions through a wrapper similar to what is found in many applications; the wrapper script creates a temp scratch directory on the shared file system, makes a symbolic link to the input file(s), performs the respective operation (read or read+write), and finally removes the temp scratch directory from the shared file system along with any symbolic links

5. Falkon (first-available policy – 0% locality): performance of Falkon using first-available scheduling policy, which includes data management but not the data aware scheduling; the workload used did not repeat any files, and hence it produced 0% cache hits and all files were first read from the shared file system to populate the local disk caches, and then the operation was performed on the locally cached data

6. Falkon (first-available policy – 100% locality): this is the same as (5) above, with a new workload, essentially repeating the same workload from (5) four times; the caches are first populated (not as part of the timed experiment), and then the workload which repeats four times could yield cache hit rates as high as 100%

7. Falkon (max-compute-util policy – 0% locality): this is similar to (5), but uses the data-aware scheduler to send tasks to the executors that have the most data cached; the workload has 0% data locality and hence cache hit rates are 0%

8. Falkon (max-compute-util policy – 100% locality): this is similar to (7), but with the workload repeated four times which could yield cache hit rates as high as 100%

The overview summary for all the experiments can be found in Figure 11 (read) and Figure 12 (read+write), which shows the throughput performance measured in Mb/s for large file sizes (100MB files) across the various configurations and a varying number of nodes. It is no surprise that configuration (8), specifically the max-compute-util which uses a data aware scheduler to dispatch tasks to nodes that already have some data cached on the local disk has the best performance with 61.7 Gb/s with 64 nodes (~94% of the peak theoretical throughput as computed from the local disk model). Even the first-available policy which does not use a data-aware scheduler and tasks are randomly dispatched to any executors performs better (~5.7Gb/s) than the shared file system alone (~3.1Gb/s).

0

10000

20000

30000

40000

50000

60000

70000


Rea

d Th

roug

hput

(Mb/

s)

1. Model (local disk)2. Model (shared file system)3. Falkon (next-available policy)4. Falkon (next-available policy) + Wrapper5. Falkon (first-available policy – 0% locality)6. Falkon (first-available policy – 100% locality)7. Falkon (max-compute-util policy – 0% locality)8. Falkon (max-compute-util policy – 100% locality)

Figure 11: Read throughput (Mb/s) for large files (100MB) for the 8 configurations for 1 – 64 nodes

The read+write performance is also similarly good for the max-compute-util policy, yielding 22.7Gb/s (~96% of the peak theoretical throughput), while with no data-aware scheduling the throughput drops to 6.3Gb/s, while with simply using the shared file system, the throughput is a mere 1Gb/s split among the 64 nodes.

0

10000

20000

30000

40000

50000

60000

70000


Rea

d+W

rite

Thro

ughp

ut (M

b/s)


Figure 12: Read+write throughput (Mb/s) for large files (100MB) for the 8 configurations for 1 – 64 nodes

In addition to I/O throughput, it is also interesting to look at the overhead the various configurations have over Falkon for workloads with many small files (1 byte in size). Figure 13 shows the throughput measured in tasks/sec for 64 nodes for the various configurations. Note that for workloads with no data access (i.e. sleep 0), Falkon can achieve 487 tasks/sec [6]. It is worthwhile to note that the data management as well as the data-aware scheduler impacts overall throughput by 5%~20% depending on the particular configuration and workload presented.

487 485

21

378 381 379

459

487

438

20

375 373 377

457

0

50

100

150

200

250

300

350

400

450

500

No Data Access 3. Falkon (next-available policy)

4. Falkon (next-available policy) +

Wrapper

5. Falkon (first-available policy –

0% locality)

6. Falkon (first-available policy –

100% locality)

7. Falkon (max-compute-util policy

– 0% locality)

8. Falkon (max-compute-util policy

– 100% locality)

Thro

ughp

ut (t

asks

/sec

)

ReadRead+Write

Figure 13: Scheduling and Data Management Overhead (with 64 nodes)

It’s also interesting to note that the configuration (4), Falkon (next-available policy) + Wrapper, has relatively poor performance (20~21 tasks/sec). The limiting factor is the fact that every task involves a creation of a directory on the shared file system, creation of a symbolic link, and the removal of the directory from the shared file system. While 21 tasks/sec is sufficient for applications that access large quantities of data in each application invocation on a moderate pool of computational resources, it can be a significant bottleneck for large scale applications with

relatively small data access patterns. Many applications that use a shared file system to read and write files from many compute processors use this method of a wrapper to cleanly separate the data between different application invocations. Essentially, any meta-data access and changes to the shared file system could be prohibitively expensive for large scale applications. Our work moves much of the load from the shared file system to the local disks at the expense of extra management that we have measured to scale significantly better than the current approaches involving shared file systems.

In the interest of focusing on the most interesting parts of these experiments, we choose to investigate three different portions of the results from Figure 11 and Figure 12, namely the 1 node, 8 nodes, and 64 nodes. 1 and 64 nodes are interesting because they are the extreme cases in our tests; the 8 node case is interesting because the shared file system we used in many of our experiments also has 8 I/O nodes servicing requests to the shared file system.

Figure 14 and Figure 15 both show that for the 1 node case, the first-available policy and the max-compute-util policy perform nearly as good as the local disk model when data locality is at 100% (~1.1Gb/s for read, and ~0.4Gb/s for read+write), while the other cases all perform close to the shared file system model (~0.5Gb/s for read and ~0.2Gb/s for read+write).

0

200

400

600

800

1000

1200

1400

1B 1KB 10KB 100KB 1MB 10MB 100MB 1GBFile Size

Rea

d Th

roug

hput

(Mb/

s)


Figure 14: Read throughput (Mb/s) for different file sizes (1B to 1GB) for the 8 configurations for 1 node

0

200

400

600

800

1000

1200

1400


Rea

d+W

rite

Thro

ughp

ut (M

b/s)


Figure 15: Read+write throughput (Mb/s) for different file sizes (1B to 1GB) for the 8 configurations for 1 node

The 8 nodes experiments (Figure 16 and Figure 17) are interesting since the shared file system had 8 I/O nodes serving the data access. Notice that for this case, only the max-compute-util with 100% data locality performs better than the shared file system (~8Gb/s compared with 3Gb/s), and the case where data locality is 0%, the throughput is at best ~2Gb/s.

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000


Rea

d Th

roug

hput

(Mb/

s)


Figure 16: Read throughput (Mb/s) for different file sizes (1B to 1GB) for the 8 configurations for 8 nodes

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000


Rea

d+W

rite

Thro

ughp

ut (M

b/s)


Figure 17: Read+write throughput (Mb/s) for different file sizes (1B to 1GB) for the 8 configurations for 8 nodes

As the number of concurrent nodes increases, we measured larger improvements in performance to using both the first-available policy (~5.7Gb/s) and the max-compute-util policy with data locality at 100% (~61.7Gb/s) as compared with the shared file system performance (~3.3Gb/s).

0

10000

20000

30000

40000

50000

60000

70000


Rea

d Th

roug

hput

(Mb/

s)


Figure 18: Read throughput (Mb/s) for different file sizes (1B to 1GB) for the 8 configurations for 64 nodes

0

10000

20000

30000

40000

50000

60000

70000


Rea

d+W

rite

Thro

ughp

ut (M

b/s)


Figure 19: Read+write throughput (Mb/s) for different file sizes (1B to 1GB) for the 8 configurations for 64 nodes

Overall, the shared file system seems to offer good performance for up to 8 concurrent nodes (mostly due to there being 8 I/O nodes servicing GPFS), however as the nodes requiring access to data increase beyond 8 nodes, the various data management and data-aware scheduling significantly outperforms the shared file system performance. The improved performance can be mostly attributed to the linear increase in I/O bandwidth proportional to compute power, however mechanisms to manage large datasets across many storage resources, the performance improvements will not be realized.

This section showed the potential improvement in performance for our proposed data-diffusion that has been integrated into Falkon to allow applications (that use Falkon) to scale and perform better than by using shared file systems. The following section investigates an astronomy application performance when using the data-diffusion mechanisms in Falkon; it also discusses the integration of Falkon as a resource provider in Swift, and the vast applications that are suitable to test and evaluate the proposed data diffusion.

7 Applications

7.1 Image Stacking in Astronomy Datasets The first analysis that we have implemented in our AstroPortal prototype is “stacking,” image cutouts from different parts of the sky. This function can help to statistically detect objects too faint otherwise. Astronomical image collections usually cover an area of sky several times (in different wavebands, different times, etc). On the other hand, there are large differences in the sensitivities of different observations: objects detected in one band are often too faint to be seen in another survey. In such cases we still would like to see whether these objects can be detected, even

in a statistical fashion. There has been a growing interest to re-project each image to a common set of pixel planes, then stacking images. The stacking improves the signal to noise, and after coadding a large number of images, there will be a detectable signal to measure the average brightness/shape etc of these objects. While this has been done for years manually for a small number of pointing fields, performing this task on wide areas of sky in a systematic way has not yet been done. It is also expected that the detection of much fainter sources (e.g., unusual objects such as transients) can be obtained from stacked images than can be detected in any individual image. AstroPortal gives the astronomy community a new tool to advance their research and opens doors to new opportunities.

7.1.1 Astronomy Datasets Astronomy faces a data avalanche, with breakthroughs in telescope, detector, and computer technology allowing astronomical surveys to produce terabytes (TB) of images. Well-known large astronomy datasets that could potentially be used by AstroPortal include the Sloan Digital Sky Survey (SDSS) [11], the Guide Star Catalog II (GSC-II) [181], the Two Micron All Sky Survey (2MASS) [182], and the Palomar Observatory Sky Survey (POSS-II) [183]. Such astronomy datasets are generally large (TB+) and contain many objects (>200M). For example, SDSS has 300M objects in 10TB of data; GSC-II has 1000M objects with 8TB of data; 2MASS has 500M objects in 10TB of data; and POSS-II has 1000M objects in 3TB of data.

0250500750

10001250150017502000225025002750

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

001E

+06

1E+06

1E+0

6

Files

Obj

ect C

ount

per

File

Figure 20

Figure 21:

We defined 4 different workloads from the SDSS DR4 dataset (320M objects), one uniform sample and three based on the ZIPF distribution of object popularity, with alpha set to 0.1, 1.0, and 2.0.

Table 2: Workload Characteristics Stacking Size (# of objects)

Object Distribution

Working Set Size (#

of files)

Working Set Size (# of objects)

Locality (accesses

per file)10000 ZIPF - alpha 0.1 1915 1924 5.2210000 ZIPF - alpha 1.0 2755 2819 3.6310000 ZIPF - alpha 2.0 6110 6259 1.6410000 UNIFORM 9771 10000 1.02

7.1.2 Performance Evaluation

0

50

100

150

200

250

300

350

400

450

Web Server Shared FileSystem (WAN)

Shared FileSystem (LAN)

Data Diffusion

Original Dataset Location

Tim

e (s

ec)

ZIPF - alpha 0.1ZIPF - alpha 1.0ZIPF - alpha 2.0UNIFORM

Figure 22: Stacking Performance Evaluation

0

50

100

150

200

250

300

350

10x10 100x100 500x500Object Size

Tim

e (s

ec)

Shared File System (LAN)Data Diffusion

Figure 23: Object Size Effect

0 20 40 60 80 100 120 140

Data Diffusion (UNIFORM)

Data Diffusion (ZIPF 0.1)



Shared File System (LAN - UNIFORM)

Shared File System (LAN - ZIPF 0.1)



Time (sec)

1st Time2nd Time

Figure 24: Data Diffusion Effect

0

100

200

300

400

500

600

700

800

900

1000

0 10 20 30 40 50 60 70 80 90 100 110 120Time (sec)

Thro

ughp

ut (C

utou

ts/s

ec)

10K Objects - UNIFORM 1st Time: Instantenous Throughput10K Objects - UNIFORM 1st Time: Average Throughput10K Objects - UNIFORM 2nd Time: Instantenous Throughput10K Objects - UNIFORM 2nd Time: Average Throughput

Figure 25: 10K*2 Stacking Workload

0

50

100

150

200

250

0 100 200 300 400 500 600 700Time (sec)

Thro

ughp

ut (C

utou

ts/s

ec)

100K Objects - UNIFORM: Instantenous Throughput100K Objects - UNIFORM: Average Throughput

Figure 26: 100K Stacking Workload

7.2 Swift We have integrated Falkon into the Karajan [25, 19] workflow engine, which in term is used by the Swift parallel programming system [19]. Thus, Karajan and Swift applications can use Falkon without modification. We observe reductions in end-to-end run time by as much as 90% when compared to traditional approaches in which applications used batch schedulers directly.

13

Virtual Node(s)

SwiftScript

Abstractcomputation

Virtual DataCatalog

SwiftScriptCompiler

Specification Execution

Virtual Node(s)

Provenancedata

ProvenancedataProvenance

collector

launcher

launcher

file1

file2

file3

AppF1

AppF2

Scheduling

Execution Engine(Karajan w/

Swift Runtime)

Swift runtimecallouts

CC CC

Status reporting

Swift Architecture

Provisioning

FalkonResource

Provisioner

AmazonEC2

Figure 27: Swift Architecture

Swift has been applied to applications in the physical sciences, biological sciences, social sciences, humanities, computer science, and science education. Table 3 characterizes some applications in terms of the typical number of tasks and stages.

Table 3: Swift applications [19]; all could benefit from Falkon

Application #Tasks/workflow #StagesATLAS: High Energy

Physics Event Simulation 500K 1

fMRI DBIC: AIRSN Image Processing 100s 12

FOAM: Ocean/Atmosphere Model 2000 3GADU: Genomics 40K 4

HNL: fMRI Aphasia Study 500 4NVO/NASA: Photorealistic

Montage/Morphology 1000s 16

QuarkNet/I2U2: Physics Science Education 10s 3 ~ 6

RadCAD: Radiology Classifier Training 1000s 5

SIDGrid: EEG Wavelet Processing, Gaze Analysis 100s 20

SDSS: Coadd, Cluster Search 40K, 500K 2, 8

SDSS: Stacking, AstroPortal 10Ks ~ 100Ks 2 ~ 4MolDyn: Molecular Dynamics 1Ks ~ 20Ks 8

We illustrate the distinctive dynamic features in Swift using an fMRI [44] analysis workflow from cognitive neuroscience, and a photorealistic montage application from the national virtual observatory project [18, 17].

We have specifically ran several applications through Swift over Falkon (without data diffusion), fMRI, Montage, MolDyn, and Econ. We will investigate the performance benefits of data diffusion on these applications (in addition to the image stacking astronomy application) for the final paper.

8 Conclusions It has been argued that data intensive applications cannot be executed in grid environments because of the high costs of data movement. But if data analysis workloads have internal locality of reference, then it can be feasible to acquire and use even remote resources, as high initial data movement costs can be offset by many subsequent data analysis operations performed on that data. We can imagine data diffusing over an increasing number of CPUs as demand increases, and then contracting as load reduces.

We envision "data diffusion" as a process in which data is stochastically moving around in the system, and that different applications can reach a dynamic equilibrium this way. One can think of a thermodynamic analogy of an optimizing strategy, in terms of energy required to move data around (“potential wells”) and a "temperature" representing random external perturbations (“job submissions”) and system failures. This paper proposes exactly such a stochastic optimizer.

As a first step towards exploring this concept, we have extended our Falkon system to support basic data diffusion mechanisms. Next on our agenda is exploring the performance of these mechanisms with real applications and workloads. Then, we will likely start

to explore more sophisticated algorithms. For example, what should we do when an executor is released? Should we simply discard cached data, should it be moved to another active executor, or should it be moved to some permanent and shared storage resource? The answer will presumably depend on workload and system characteristics.

9 Acknowledgements This work was supported in part by the NASA Ames Research Center GSRP Grant Number NNA06CB89H and by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy, under Contract DE-AC02-06CH11357. We also thank TeraGrid and the Computation Institute at University of Chicago for hosting the experiments reported in this paper.

10 References [1] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, S.

Sekiguchi. "Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing", proceedings of the 2004 Computing in High Energy and Nuclear Physics (CHEP04), September 2004.

[2] W. Xiaohui, W.W. Li, O. Tatebe, X. Gaochao, H. Liang, J. Jiubin. "Implementing data aware scheduling in Gfarm using LSF scheduler plugin mechanism", proceedings of 2005 International Conference on Grid Computing and Applications (GCA'05), pp.3-10, 2005.

[3] M. Ernst, P. Fuhrmann, M. Gasthuber, T. Mkrtchyan, C. Waldman. “dCache, a distributed data storage caching system,” Chep 2001.

[4] P. Fuhrmann. “dCache, the commodity cache,” Twelfth NASA Goddard and Twenty First IEEE Conference on Mass Storage Systems and Technologies 2004.

[5] S. Ghemawat, H. Gobioff, S.T. Leung. “The Google file system,” Procedings of the 19th ACM SOSP (Dec. 2003), pp. 29-43.

[6] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, M. Wilde. “Falkon: a Fast and Light-weight tasK executiON framework”, to appear at IEEE/ACM SC 2007.

[7] I. Raicu, I. Foster, A. Szalay. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets”, IEEE/ACM SC 2006.

[8] I. Raicu, I. Foster, A. Szalay, G. Turcu. “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”, TeraGrid Conference 2006, June 2006.

[9] A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. “The Importance of Data Locality in Distributed Computing Applications”, NSF Workflow Workshop 2006.

[10] F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters," In Proc. of the First Conference on File and Storage Technologies (FAST), Jan. 2002.

[11] SDSS: Sloan Digital Sky Survey, http://www.sdss.org/

[12] K. Ranganathan, I. Foster, Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids , Journal of Grid Computing, V1(1) 2003.

[13] J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.

[14] I. Raicu, C. Dumitrescu, I. Foster. Dynamic Resource Provisioning in Grid Environments, to appear at TeraGrid Conference 2007.

[15] G. Banga, P. Druschel, J.C. Mogul. “Resource Containers: A New Facility for Resource Management in Server Systems.” Symposium on Operating Systems Design and Implementation, 1999.

[16] J.A. Stankovic, K. Ramamritham,, D. Niehaus, M. Humphrey, G. Wallace, “The Spring System: Integrated Support for Complex Real-Time Systems”, Real-Time Systems, May 1999, Vol 16, No. 2/3, pp. 97-125.

[17] J.C. Jacob, et al. ”The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets.” Proceedings of the Earth Science Technology Conference 2004

[18] G.B. Berriman, et al. ”Montage: a Grid Enabled Engine for Delivering Custom Science-Grade Image Mosaics on Demand.” Proceedings of the SPIE Conference on Astronomical Telescopes and Instrumentation. 2004.

[19] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. “Swift: Fast, Reliable, Loosely Coupled

Parallel Computation”, to appear at IEEE Workshop on Scientific Workflows 2007.

[20] R. Hasan, Z. Anwar, W. Yurcik, L. Brumbaugh, R. Campbell. “A Survey of Peer-to-Peer Storage Techniques for Distributed File Systems”, Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II - Volume 02, 2005.

[21] S. Podlipnig, L. Böszörmenyi. “A survey of Web cache replacement strategies”, ACM Computing Surveys (CSUR), Volume 35 , Issue 4 (December 2003), Pages: 374 – 398.

[22] R. Lancellotti, M. Colajanni, B. Ciciani, "A Scalable Architecture for Cooperative Web Caching", Proc. of Workshop in Web Engineering, Networking 2002, Pisa, May 2002.

[23] TeraGrid, http://www.teragrid.org/

[24] D. Thain, T. Tannenbaum, and M. Livny, “Distributed Computing in Practice: The Condor Experience” Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.

[25] Swift Workflow System: www.ci.uchicago.edu/swift, 2007.

[26] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. “Swift: Fast, Reliable, Loosely Coupled Parallel Computation”, IEEE Workshop on Scientific Workflows 2007.

[27] I. Foster, J. Voeckler, M. Wilde, Y. Zhao. “Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation”, SSDBM 2002.

[28] J.-P Goux, S. Kulkarni, J.T. Linderoth, and M.E. Yoder, “An Enabling Framework for Master-Worker Applications on the Computational Grid,” IEEE International Symposium on High Performance Distributed Computing, 2000.

[29] I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Journal of Supercomputer Applications, 15 (3). 200-222. 2001.

[30] G. Banga, P. Druschel, J.C. Mogul. “Resource Containers: A New Facility for Resource Management in Server Systems.” Symposium on Operating Systems Design and Implementation, 1999.

[31] J.A. Stankovic, K. Ramamritham,, D. Niehaus, M. Humphrey, G. Wallace, “The Spring System: Integrated Support for Complex Real-Time Systems”, Real-Time Systems, May 1999, Vol 16, No. 2/3, pp. 97-125.

[32] J. Frey, T. Tannenbaum, I. Foster, M. Frey, S. Tuecke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Cluster Computing, 2002.

[33] G. Singh, C. Kesselman, E. Deelman, “Optimizing Grid-Based Workflow Execution.” Journal of Grid Computing, Volume 3(3-4), December 2005, pp. 201-219.

[34] E. Walker, J.P. Gardner, V. Litvin, E.L. Turner, “Creating Personal Adaptive Clusters for Managing Scientific Tasks in a Distributed Computing Environment”, Workshop on Challenges of Large Applications in Distributed Environments, 2006.

[35] G. Singh, C. Kesselman E. Deelman. “Performance Impact of Resource Provisioning on Workflows”, USC ISI Technical Report 2006.

[36] G. Mehta, C. Kesselman, E. Deelman. “Dynamic Deployment of VO-specific Schedulers on Managed Resources,” USC ISI Technical Report, 2006.

[37] D. Thain, T. Tannenbaum, and M. Livny, “Condor and the Grid", Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003. ISBN: 0-470-85319-0.

[38] E. Robinson, D.J. DeWitt. “Turning Cluster Management into Data Management: A System Overview”, Conference on Innovative Data Systems Research, 2007.

[39] B. Bode, D.M. Halstead, R. Kendall, Z. Lei, W. Hall, D. Jackson. “The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters”, Usenix, 4th Annual Linux Showcase & Conference, 2000.

[40] S. Zhou. “LSF: Load sharing in large-scale heterogeneous distributed systems,” Workshop on Cluster Computing, 1992.

[41] W. Gentzsch, “Sun Grid Engine: Towards Creating a Compute Power Grid,” 1st International Symposium on Cluster Computing and the Grid, 2001.

[42] D.P. Anderson. “BOINC: A System for Public-Resource Computing and Storage.” 5th

IEEE/ACM International Workshop on Grid Computing, 2004.

[43] D.P. Anderson, E. Korpela, R. Walton. “High-Performance Task Distribution for Volunteer Computing.” IEEE Conference on e-Science and Grid Technologies, 2005.

[44] The Functional Magnetic Resonance Imaging Data Center, http://www.fmridc.org/, 2007.

[45] G.B. Berriman, et al., “Montage: a Grid Enabled Engine for Delivering Custom Science-Grade Image Mosaics on Demand.” SPIE Conference on Astronomical Telescopes and Instrumentation. 2004.

[46] K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger, “Oceano - SLA Based Management of a Computing Utility,” 7th IFIP/IEEE International Symposium on Integrated Network Management, 2001.

[47] L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, J. Chase. “Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control,” IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC06), 2006.

[48] J. Bresnahan. “An Architecture for Dynamic Allocation of Compute Cluster Bandwidth”, MS Thesis, Department of Computer Science, University of Chicago, December 2006.

[49] C. Catlett, et al., “TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications,” HPC 2006.

[50] M. Feller, I. Foster, and S. Martin. “GT4 GRAM: A Functionality and Performance Study”, TeraGrid Conference 2007.

[51] I. Foster, “Globus Toolkit Version 4: Software for Service-Oriented Systems,” Conference on Network and Parallel Computing, 2005.

[52] The Globus Security Team. “Globus Toolkit Version 4 Grid Security Infrastructure: A Standards Perspective,” Technical Report, Argonne National Laboratory, MCS, 2005.

[53] I. Raicu, I. Foster, A. Szalay. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets”, IEEE/ACM International Conference for High Performance

Computing, Networking, Storage, and Analysis (SC06), 2006.

[54] I. Raicu, I. Foster, A. Szalay, G. Turcu. “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”, TeraGrid Conference 2006.

[55] J.C. Jacob, et al. “The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets.” Earth Science Technology Conference 2004.

[56] E. Deelman, et al. “Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems”, Scientific Programming Journal, Vol 13(3), 2005, 219-237.

[57] T. Tannenbaum. “Condor RoadMap”, Condor Week 2007.

[58] K. Ranganathan, I. Foster, “Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids”, Journal of Grid Computing, V1(1) 2003.

[59] A. Iosup, C. Dumitrescu, D.H.J. Epema, H. Li, L. Wolters, “How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications”, IEEE/ACM International Conference on Grid Computing (Grid), 2006.

[60] A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, “The Characteristics and Performance of Groups of Jobs in Grids”, EuroPar 2007.

[61] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, S. Sekiguchi. “Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing”, Computing in High Energy and Nuclear Physics (CHEP04), 2004.

[62] W. Xiaohui, W.W. Li, O. Tatebe, X. Gaochao, H. Liang, J. Jiubin. “Implementing data aware scheduling in Gfarm using LSF scheduler plugin mechanism”, International Conference on Grid Computing and Applications (GCA'05), pp.3-10, 2005.

[63] M. Ernst, P. Fuhrmann, M. Gasthuber, T. Mkrtchyan, C. Waldman. “dCache, a distributed data storage caching system,” Computing in High Energy and Nuclear Physics (CHEP01), 2001.

[64] P. Fuhrmann. “dCache, the commodity cache,” Twelfth NASA Goddard and Twenty First IEEE Conference on Mass Storage Systems and Technologies 2004.

[65] S. Ghemawat, H. Gobioff, S.T. Leung. “The Google file system,” 19th ACM SOSP, pp. 29-43, 2003.

[66] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, M. Wilde. “Falkon: a Fast and Light-weight tasK executiON framework”, IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC07), 2007.

[67] I. Raicu, I. Foster, A. Szalay. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets”, IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC06), 2006.

[68] I. Raicu, I. Foster, A. Szalay, G. Turcu. “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”, TeraGrid Conference 2006.

[69] A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. “The Importance of Data Locality in Distributed Computing Applications”, NSF Workflow Workshop 2006.

[70] F. Schmuck and R. Haskin, “GPFS: A Shared-Disk File System for Large Computing Clusters,” Conference on File and Storage Technologies (FAST), 2002.

[71] I. Raicu, C. Dumitrescu, I. Foster. “Dynamic Resource Provisioning in Grid Environments”, TeraGrid Conference 2007.

[72] J.C. Jacob, et al. “The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets.” Earth Science Technology Conference 2004.

[73] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. “Swift: Fast, Reliable, Loosely Coupled Parallel Computation”, IEEE Workshop on Scientific Workflows 2007.

[74] R. Hasan, Z. Anwar, W. Yurcik, L. Brumbaugh, R. Campbell. “A Survey of Peer-to-Peer Storage Techniques for Distributed File Systems”, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II - Volume 02, 2005.

[75] S. Podlipnig, L. Böszörmenyi. “A survey of Web cache replacement strategies”, ACM Computing Surveys (CSUR), Volume 35 , Issue 4, Pages: 374 – 398, 2003.

[76] R. Lancellotti, M. Colajanni, B. Ciciani, “A Scalable Architecture for Cooperative Web Caching”, Workshop in Web Engineering, Networking 2002, 2002.

[77] C. Catlett, et al., “TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications,” HPC 2006.

[78] D. Thain, T. Tannenbaum, and M. Livny, “Distributed Computing in Practice: The Condor Experience” Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, 2005.

[79] Swift Workflow System: www.ci.uchicago.edu/swift

80] I. Foster, J. Voeckler, M. Wilde, Y. Zhao. “Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation”, SSDBM 2002.

[81] J.-P Goux, S. Kulkarni, J.T. Linderoth, and M.E. Yoder, “An Enabling Framework for Master-Worker Applications on the Computational Grid,” IEEE International Symposium on High Performance Distributed Computing, 2000.

[82] I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Journal of Supercomputer Applications, 15 (3). 200-222, 2001.

[83] J. Frey, T. Tannenbaum, I. Foster, M. Frey, S. Tuecke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Cluster Computing, vol. 5, pp. 237-246, 2002.

[84] E. Walker, J.P. Gardner, V. Litvin, E.L. Turner, “Creating Personal Adaptive Clusters for Managing Scientific Tasks in a Distributed Computing Environment”, Workshop on Challenges of Large Applications in Distributed Environments, 2006.

[85] G. Singh, C. Kesselman E. Deelman. “Performance Impact of Resource Provisioning on Workflows”, Technical Report, USC ISI, 2006.

[86] G. Mehta, C. Kesselman, E. Deelman. “Dynamic Deployment of VO-specific Schedulers on Managed Resources”, Technical Report, USC ISI, 2006.

[87] E. Robinson, D.J. DeWitt. “Turning Cluster Management into Data Management: A System

Overview”, Conference on Innovative Data Systems Research, 2007.

[88] B. Bode, D.M. Halstead, R. Kendall, Z. Lei, W. Hall, D. Jackson. “The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters”, Usenix Linux Showcase & Conference, 2000.

[89] S. Zhou. “LSF: Load sharing in large-scale heterogeneous distributed systems,” Workshop on Cluster Computing, 1992.

[90] D.P. Anderson. “BOINC: A System for Public-Resource Computing and Storage.” IEEE/ACM International Workshop on Grid Computing. November 8, 2004.

[91] D.P. Anderson, E. Korpela, R. Walton. “High-Performance Task Distribution for Volunteer Computing.” IEEE International Conference on e-Science and Grid Technologies, 2005.

[92] The Functional Magnetic Resonance Imaging Data Center, http://www.fmridc.org/, 2007.

[93] K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger, “Oceano - SLA Based Management of a Computing Utility,” IFIP/IEEE International Symposium on Integrated Network Management, 2001.

[94] L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, J. Chase. “Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control,” IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC06), 2006.

[95] J. Bresnahan. “An Architecture for Dynamic Allocation of Compute Cluster Bandwidth”, MS Thesis, Department of Computer Science, University of Chicago, 2006.

[96] M. Feller, I. Foster, and S. Martin. “GT4 GRAM: A Functionality and Performance Study”, TeraGrid Conference 2007.

[97] The San Diego Supercomputer Center (SDSC) DataStar Log, March 2004 thru March 2005, http://www.cs.huji.ac.il/labs/parallel/workload/l_sdsc_ds/index.html.

[98] V. Nefedova, et al., “Molecular Dynamics (LDRD)”, 2007, http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ApplicationStatus.

[99] T. Stef, et al., “Moral Hazard”, 2007, http://www.ci.uchicago.edu/wiki/bin/view/SWFT/ApplicationStatus.

[100] ASC / Alliances Center for Astrophysical Thermonuclear Flashes, 2007, http://www.flash.uchicago.edu/website/home/.

[101] A. Szalay, J. Gray, A. Thakar, P. Kuntz, T. Malik, J. Raddick, C. Stoughton and J. Vandenberg, “The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data”, ACM SIGMOD, 2002.

[102] R. Williams, A. Connolly, J. Gardner. “The NSF National Virtual Observatory”, TeraGrid Utilization Proposal to NRAC, Multi-year, Large Research Collaboration, 2003.

[103] J.C. Jacob, D.S. Katz, T. Prince, G.B. Berriman, J.C. Good, A.C. Laity, E. Deelman, G. Singh, M.H. Su. “The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets”, Earth Science Technology Conference, 2004.

[104] Sloan Digital Sky Survey / SkyServer, http://cas.sdss.org/astro/en/, 2007.

[105] US National Virtual Observatory (NVO), http://www.us-vo.org/index.cfm, 2007.

[106] MEDICUS (Medical Imaging and Computing for Unified Information Sharing): http://dev.globus.org/wiki/Incubator/MEDICUS, 2007.

[107] R. Buyya and M. Murshed. “GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing”, The Journal of Concurrency and Computation: Practice and Experience (CCPE), Volume 14, Issue 13-15, Wiley Press, 2002.

[108] A. Sulistio, C.S. Yeo, and R. Buyya. “A Taxonomy of Computer-based Simulations and its Mapping to Parallel and Distributed Systems Simulation Tools”, International Journal of Software: Practice and Experience, Volume 34, Issue 7, Pages: 653-673, Wiley Press, 2004.

[109] GridBus Team, “GridSim: A Grid Simulation Toolkit for Resource Modelling and Application Scheduling for Parallel and Distributed Computing”. The GridBus Project, Grid Computing and Distributed Systems (GRIDS) Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, 2007.

[110] V. Singh, J. Gray, A.R. Thakar, A.S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, B. Yanny, “SkyServer Traffic Report – The First Five Years,” MSR-TR-2006-190, 2006.

[111] M. Danelutto, “Adaptive Task Farm Implementation Strategies”, Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 416-423, IEEE press, ISBN 0-7695-2083-9, 2004, A Coruna (E), 11-13 February 2004.

[112] E. Heymann, M.A. Senar, E. Luque, and M. Livny, “Adaptive Scheduling for Master-Worker Applications on the Computational Grid”, IEEE/ACM International Workshop on Grid Computing (GRID00), 2000.

[113] H. González-Vélez, “An Adaptive Skeletal Task Farm for Grids”, International Euro-Par 2005 Parallel Processing, pp. 401-410, 2005.

[114] H. Casanova, M. Kim, J.S. Plank, J.J. Dongarra, “Adaptive Scheduling for Task Farming with Grid Middleware”, International Euro-Par Conference, 1999.

[115] D. Petrou, G.A. Gibson, G.R. Ganger, “Scheduling Speculative Tasks in a Compute Farm”, IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC05), 2005.

[116] F.J.L. Reid, “Task farming on Blue Gene”, EEPC, Edinburgh University, 2006.

[117] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, S. Sekiguchi, “Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing”, Computing in High Energy and Nuclear Physics (CHEP04), 2004.

[118] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, S. Sekiguchi, “Grid Datafarm Architecture for Petascale Data Intensive Computing”, IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), pp.102-110, 2002

[119] N. Yamamoto, O. Tatebe, S. Sekiguchi, “Parallel and Distributed Astronomical Data Analysis on Grid Datafarm”, IEEE/ACM International Workshop on Grid Computing (Grid 2004), pp.461-466, 2004.

[120] W.H. Bell, D.G. Cameron, L. Capozza, A.P. Millar, K. Stockinger, and F. Zini. “OptorSim - A Grid Simulator for Studying Dynamic Data Replication Strategies”, International Journal of High Performance Computing Applications, 17(4), 2003.

[121] C. Nicholson, D.G. Cameron, A.T. Doyle, A.P. Millar, K. Stockinger, “Dynamic Data Replication in LCG 2008”, UK e-Science All Hands Conference 2006.

[122] A.T. Doyle, C. Nicholson. “Grid Data Management: Simulations of LCG 2008”, Computing in High Energy and Nuclear Physics (CHEP06), 2006.

[123] A. Sulistio, U. Cibej, B. Robic and R. Buyya. “A Tool for Modeling and Simulation of Data Grids with Integration of Data Storage, Replication and Analysis”, Technical Report, GRIDS-TR-2005-13, Grid Computing and Distributed Systems Laboratory, University of Melbourne, Australia, 2005.

[124] N. Muthuvelu, J. Liu, N.L. Soe, S. Venugopal, A. Sulistio and R. Buyya, “A Dynamic Job Grouping-Based Scheduling for Deploying Applications with Fine-Grained Tasks on Global Grids”, Australasian Workshop on Grid Computing and e-Research (AusGrid 2005), 2005.

[125] K. Ranganathan and I. Foster, “Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids”, Journal of Grid Computing, V1(1) 2003.

[126] A.L. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, R. Schwartzkopf, “The Replica Location Service”, International Symposium on High Performance Distributed Computing Conference (HPDC-13), June 2004.

[127] A. Chervenak, R. obert Schuler. “The Data Replication Service”, Technical Report, USC ISI, 2006.

[128] M. Branco, “DonQuijote - Data Management for the ATLAS Automatic Production System”, Computing in High Energy and Nuclear Physics (CHEP04), 2004.

[129] D. Olson and J. Perl, “Grid Service Requirements for Interactive Analysis”, PPDG CS11 Report, September 2002.

[130] S. Langella, S. Hastings, S. Oster, T. Kurc, U. Catalyurek, J. Saltz. “A Distributed Data Management Middleware for Data-Driven Application Systems” IEEE International Conference on Cluster Computing, 2004.

[131] S. Hastings, S. Oster. “DPACS: A Distributed Scaleable Image Archival and Browsing Infrastructure”, Technical Report, The Ohio State University, Department of Biomedical Informatics, 2004.

[132] S. Hastings, S. Oster, S. Langella, T. Kurc, J. Saltz, “Grid PACS: A Grid-Enabled System for Management and Analysis of Large Image Datasets”, Advancing Practice, Instruction and

Innovation Through Informatics (APIII), Frontiers in Oncology and Pathology Informatics, 2004.

[133] A. Chervenak, E. Deelman, C. Kesselman, B. Allcock, I. Foster, V. Nefedova, J. Lee, A. Sim, A. Shoshani, B. Drach, D. Williams, D. Middleton. “High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies”, Parallel Computing, Special issue: High performance computing with geographical data, Volume 29 , Issue 10 , pp 1335 – 1356, 2003.

[134] C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta, “Distributed Data Mining for Astronomy Catalogs”, Workshop on Mining Scientific and Engineering Datasets, 2006.

[135] H. Dutta, C. Giannella, K. Borne and H. Kargupta, “Distributed Top-K Outlier Detection in Astronomy Catalogs using the DEMAC system”, SIAM International Conference on Data Mining, 2007.

[136] N. Chrisochoides, A. Fedorov, A. Kot, N. Archip, P. Black, O. Clatz, A. Golby, R. Kikinis, S.K. Warfield. “Toward Real-Time, Image Guided Neurosurgery Using Distributed and Grid Computing”, IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC06), 2006.

[137] S.G. Erberich, M.E. Bhandekar, A. Chervenak, R. Yordy, C. Larsen, M.D. Nelson, C. Kesselman. “Next Generation Enterprise PACS: Integration of Open Grid Service Architecture with Commercial PACS”, Educational Exhibit, Informatics-3131, RSNA Annual Meeting, 2006.

[138] S.G. Erberich, M. Bhandekar, A. Chervenak, M.D. Nelson, C. Kesselman, “DICOM grid interface service for clinical and research PACS: A globus toolkit web service for medical data grids”, International Journal of Computer Assistant Radiology and Surgery, 1:87-105; p100-102, Springer, 2006.

[139] S.G. Erberich, M. Dixit, A. Chervenak, V. Chen, M.D. Nelson, C. Kesselmann. “Grid Based Medical Image Workflow and Archiving for Research and Enterprise PACS Applications”, PACS and Imaging Informatics, SPIE Medical Imaging, 6145-32, 2006.

[140] S. Hastings, S. Oster, S. Langella, T.M. Kurc, T. Pan, U.V. Catalyurek, and J.H. Saltz, “A Grid-Based Image Archival and Analysis System”,

Journal of the American Medical Information Association;12:286-295. DOI 10.1197/jamia.M1698, 2005.

[141] M. Beynon, T.M. Kurc, U.V. Catalyurek, C. Chang, A. Sussman, J.H. Saltz. “Distributed Processing of Very Large Datasets with DataCutter”, Parallel Computing, Vol. 27, No. 11, pp. 1457-1478, 2001.

[142] D.L. Adams, K. Harrison, C.L. Tan. “DIAL: Distributed Interactive Analysis of Large Datasets”, Conference for Computing in High Energy and Nuclear Physics (CHEP 06), 2006.

[143] D.T. Liu, M.J. Franklin. “The Design of GridDB: A Data-Centric Overlay for the Scientific Grid”, VLDB04, pp. 600-611, 2004.

[144] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber. “Bigtable: A Distributed Storage System for Structured Data”, Symposium on Operating System Design and Implementation (OSDI'06), 2006.

[145] R. Pike, S. Dorward, R. Griesemer, S. Quinlan. “Interpreting the Data: Parallel Analysis with Sawzall”, Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 13:4, pp. 227-298, 2005.

[146] J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters”, Symposium on Operating System Design and Implementation (OSDI'04), 2004.

[147] Digital Dig - Data Mining in Astronomy, http://www.astrosociety.org/pubs/ezine/datamining.html, 2007.

[148] NASA’s Data Mining Resources for Space Science, http://rings.gsfc.nasa.gov/~borne/nvo_datamining.html, 2007.

[149] Framework for Mining and Analysis of Space Science Data, http://www.itsc.uah.edu/f-mass/, 2007.

[150] The ClassX Project: Classifying the High-Energy Universe, http://heasarc.gsfc.nasa.gov/classx/, 2007.

[151] The AUTON Project, http://www.autonlab.org/autonweb/showProject/3/, 2007.

[152] GRIST: Grid Data Mining for Astronomy, http://grist.caltech.edu, 2007.

[153] A. Iosup, C. Dumitrescu, D.H.J. Epema, H. Li, L. Wolters, “How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications”, IEEE/ACM International Conference on Grid Computing (Grid), 2006.

[154] NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, 2007.

[155] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, M. Wilde. “Dynamic Resource Provisioning in Grid Environments”, Technical Report, University of Chicago, 2007, http://people.cs.uchicago.edu/~iraicu/research/reports/up/DRP_v01.pdf.

[156] TeraPort Cluster, http://teraport.uchicago.edu/, 2007.

[157] Tier3 Cluster, http://twiki.mwt2.org/bin/view/UCTier3/WebHome, 2007.

[158] I. Raicu, I. Foster, A. Szalay. “3DcacheGrid: Dynamic Distributed Data Cache Grid Engine”, Technical Report, University of Chicago, 2006, http://people.cs.uchicago.edu/~iraicu/research/reports/up/HPC_SC_2006_v09.pdf.

[159] I. Raicu, I. Foster. “Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore”, Technical Report, University of Chicago, 2006, http://people.cs.uchicago.edu/~iraicu/research/reports/up/Storage_Compute_RM_Performance_06.pdf.

[160] I. Raicu, Y. Zhao, I. Foster, A. Szalay. “Data Diffusion Delivers Dynamic Digging,” Technical Report, University of Chicago, 2007, http://people.cs.uchicago.edu/~iraicu/research/reports/up/data-diffusion_hpdc_hot-topics_v05.pdf.

[161] I. Raicu, Y. Zhao, I. Foster, A. Szalay. “A Data Diffusion Approach to Large Scale Scientific Exploration,” under review at the 2007 Microsoft eScience Workshop at RENCI, http://people.cs.uchicago.edu/~iraicu/research/reports/up/data-diffusion_mses07_v04-submitted.pdf.

[162] I. Raicu, I. Foster. “SkyServer Web Service”, Technical Report, University of Chicago, 2007, http://people.cs.uchicago.edu/~iraicu/research/reports/up/SkyServerWS_06.pdf.

[163] I. Raicu, I. Foster. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets”, NASA GSRP Proposal,

Ames Research Center, NASA, February 2006 -- Award funded 10/1/06 - 9/30/07, http://people.cs.uchicago.edu/~iraicu/research/reports/up/NASA_proposal06.pdf.

[164] I. Raicu, I. Foster. “Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets: Year 1 Status and Year 2 Proposal”, NASA GSRP Year 1 Progress Report and Year 2 Proposal, Ames Research Center, NASA, February 2007 -- Award funded 10/1/07 - 9/30/08, http://people.cs.uchicago.edu/~iraicu/research/reports/up/NASA_proposal07.pdf.

[165] Amazon, “Amazon Elastic Compute Cloud (Amazon EC2)”, http://www.amazon.com/gp/browse.html?node=201590011, 2007.

[166] B. Sotomayor, K. Keahey, I. Foster, T. Freeman. “Enabling Cost-Effective Resource Leases with Virtual Machines,” HPDC Hot Topics, 2007.

[167] K. Keahey, T. Freeman, J. Lauret, D. Olson. “Virtual Workspaces for Scientific Applications,” SciDAC 2007.

[168] R. Bradshaw, N. Desai, T. Freeman, K. Keahey. “A Scalable Approach To Deploying And Managing Appliances,” TeraGrid Conference 2007.

[169] B. Sotomayor, “A Resource Management Model for VM-Based Virtual Workspaces”, Masters paper, University of Chicago, 2007.

[170] Y. Zhao, M. Hategan, I. Raicu, M. Wilde, I. Foster. “Swift: a Parallel Programming Tool for Large Scale Scientific Computations”, under review at Journal of Grid Computing.

[171] R.L. White, D.J. Helfand, R.H. Becker, E. Glikman, and W. Vries. “Signals from the Noise: Image Stacking for Quasars in the FIRST Survey”, The Astrophysical Journal, Volume 654, pages 99 – 114, 2007.

[172] I. Raicu, I. Foster, Z. Zhang. “Enabling Serial Job Execution on the BlueGene Supercomputer with Falkon,” work in progress, http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/Falkon_BG.

[173] SDSS: Sloan Digital Sky Survey, http://www.sdss.org/.

[174] Globus Incubation Management Project, http://dev.globus.org/wiki/Incubator.

[175] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster and M. Wilde. “Falkon: A Proposal for Project

Globus Incubation”, Globus Incubation Management Project, 2007, http://people.cs.uchicago.edu/~iraicu/research/reports/up/Falkon-GlobusIncubatorProposal_v3.pdf.

[176] I. Raicu, , I. Foster. “Characterizing Storage Resources Performance in Accessing the SDSS Dataset,” Technical Report, University of Chicago, 2005, http://people.cs.uchicago.edu/~iraicu/research/reports/up/astro_portal_report_v1.5.pdf

[177] I. Raicu, , I. Foster. “Characterizing the SDSS DR4 Dataset and the SkyServer Workloads,” Technical Report, University of Chicago, 2006, http://people.cs.uchicago.edu/~iraicu/research/reports/up/SkyServer_characterization_2006.pdf

[178] I. Raicu, C. Dumitrescu, I. Foster. “Provisioning EC2 Resources,” work in progress, http://www.ci.uchicago.edu/wiki/bin/view/VDS/DslCS/Falkon_EC2.

[179] I Foster, C Kesselman, S Tuecke, "The Anatomy of the Grid", Intl. Supercomputing Applications, 2001.

[180] SDSS: Sloan Digital Sky Survey, http://www.sdss.org/

[181] GSC-II: Guide Star Catalog II, http://www-gsss.stsci.edu/gsc/GSChome.htm

[182] 2MASS: Two Micron All Sky Survey, http://irsa.ipac.caltech.edu/Missions/2mass.html

[183] POSS-II: Palomar Observatory Sky Survey, http://taltos.pha.jhu.edu/~rrg/science/dposs/dposs.html

[184] R Williams, A Connolly, J Gardner. “The NSF National Virtual Observatory, TeraGrid Utilization Proposal to NRAC”, http://www.us-vo.org/pubs/files/teragrid-nvo-final.pdf

[185] JC Jacob, DS Katz, T Prince, GB Berriman, JC Good, AC Laity, E Deelman, G Singh, MH Su. "The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets", Earth Science Technology Conference, 2004.

[186] TeraGrid, http://www.teragrid.org/

[187] TeraGrid: Science Gateways, http://www.teragrid.org/programs/sci_gateways/

[188] I Foster. “A Globus Toolkit Primer,” 02/24/2005. http://www-unix.globus.org/toolkit/docs/development/3.9.5/key/GT4_Primer_0.6.pdf

[189] B Allcock, J Bresnahan, R Kettimuthu, M Link, C Dumitrescu, I Raicu, I Foster. "The Globus Striped GridFTP Framework and Server", IEEE/ACM SC 2005.

[190] SDSS Data Release 4 (DR4), http://www.sdss.org/dr4/

[191] Sloan Digital Sky Survey / SkyServer, http://cas.sdss.org/astro/en/

[192] I Raicu, C Dumitrescu, M Ripeanu, I Foster. “The Design, Performance, and Use of DiPerF: An automated DIstributed PERformance testing Framework”, to appear in the Journal of Grid Computing, Special Issue on Global and Peer-to-Peer Computing 2006.

[193] US National Virtual Observatory (NVO), http://www.us-vo.org/index.cfm

[194] JAVA FITS library: http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/java/v0.9/

[195] I. Foster, V. Nefedova, L. Liming, R. Ananthakrishnan, R. Madduri, L. Pearlman, O. Mulmo, M. Ahsant. “Streamlining Grid Operations: Definition and Deployment of a Portal-based User Registration Service.” Workshop on Grid Applications: from Early Adopters to Mainstream Users 2005.

[196] Hanisch, R. J.; Farris, A.; Greisen, E. W.; Pence, W. D.; Schlesinger, B. M.; Teuben, P. J.; Thompson, R. W.; Warnock, A., III. Definition of the Flexible Image Transport System (FITS) Astronomy and Astrophysics, v.376, p.359-380, September 2001.

accelerating large scale scientific exploration through data...

Documents