dynamic hierarchical model for fault tolerant grid computing

8/10/2019 Dynamic Hierarchical Model for Fault Tolerant Grid Computing

1/13

World Applied Programming, Vol (1), No (5), December 2011. 309-321

ISSN: 2222-2510

2011 WAP journal. www.waprogramming.com

309

Dynamic Hierarchical Model

for Fault Tolerant Grid Computing

Mohammed REBBAH * Yahya SLIMANI

Computer Science Department, University of Mascara,

LRGB Laboratory, EDTEC Group

Mascara, Algeria

[email protected]

Computer Science Department, University of El Manar,

Tunis, Tunisia

[email protected]

Abdelkader BENYETTOU Lionel BRUNIE

University of Sciences and Technology of Oran

Mohammed BOUDIAF,

Oran, Algeria

[email protected]

University of Lyon, CNRS, INSA-Lyon, LIRIS,

UMR5205, F-69621,

Lyon, France

[email protected]

Abstract:Our contribution in this paper is twofold. Firstly, we propose a dynamic hierarchical model

for the grid, which models the grid as a dynamic n-ary tree, composed of a root, a set of intermediate

levels according to the number of available resources and the lowest level containing the resources

loaded to execute jobs. Secondly, we support our model by a mechanism of fault tolerance based on

distribution and swapping techniques. The technique of distribution adopted to tolerate faults in the

intermediate levels allows to keep jobs in their leaves and to reconnect the children of the failed nodes

to the siblings of their parents without any replication. The implementation of our model over Globus

Toolkit 4 allows extending its functionality to tolerate faults.

Key word: Grid computing, Fault tolerance, Dynamic hierarchical model, Distribution, Swaping,

Globus Toolkit.

I. INTRODUCTION

The grid computing was primarily introduced by Foster and Kesselman [1] to define a distributed computinginfrastructure for advanced science and engineering. It is a collection of distributed computing resources availableover a local or wide area network that appears to an end user or application as one large virtual computing system.The main goal of this infrastructure is to provide shared heterogeneous services and resources accessible by usersand applications to solve high computational problems and access to large storage spaces.

Most of the models are static, where the grid is formed, in general, by a set of clusters connected through alocal area network (LAN) or wide area network (WAN), every cluster contains a set of local sites managing a setof nodes loaded to execute the user's jobs, often called the G/S/M model [27, 28], although these models have

yielded encouraging results but they are limited in some areas, such as:1.Grid resources mismanagement: this mismanagement goes in both directions, in the case of the reduction

of grid resources, it is pointless to structure them into a number of elevated levels and in the other sense, when thenumber of resources is scalable, which typically require elevated levels of hierarchy.

2. Mismanagement of the dynamicity:One of the fundamental characteristics of the grid is the dynamicity ofresources. This dynamicity cannot be treated by a static hierarchical model.

In this paper, we propose a dynamic hierarchical model, which models the grid as a dynamic n-ary tree,composed of a root, a set of intermediate levels according to the number of available resources and the lowestlevel containing the resources loaded to execute jobs. The dynamic nature of the proposed model is related to thenumber of available resources in the grid and their distributions in the tree.

The large computing potentiality of computational grids is often hampered by their susceptibility to failures,which include process failures, machine crashes and network failures. In grid computing, the fault management is

very important and difficult problem for grid application developers. The failure of resources fatally affects job


2/13

Mohammed REBBAH et al., World Applied Programming (WAP),Vol (1), No (5), December 2011.

310

execution therefore fault tolerance functionality is essential in grid computing [2]. The computational grid consistsof large sets of diverse resources geographically distributed, that are grouped into virtual computers for executingspecific applications. As the number of grid system components increases, the probability of failure is higher thanin traditional parallel computing [1].

In this paper, we present two mechanisms of fault tolerance based on the distribution and the swapping

techniques. We have implemented our model by a grid service (Dynamic Hierarchical Model for Fault TolerantGrid; DHM-FTGrid) over Globus Toolkit 4. The mechanism of fault tolerance proposed treats crash faults andnetwork failures. It is based on the error recovery by distribution and swapping techniques.

The rest of this paper is organized as follows: Section 2 gives an overview on different works on faulttolerance in grid computing. Related work is discussed in section 3. Section 4 defines our proposed model, itsvarious actors and the familys concept defines between the levels of the model. Section 5 describes the types offaults treated and the mechanism of fault tolerance proposed; section 6 presents the architecture of DHM-FTGridgrid service developed over Globus GT4 [29] to validate our proposed model and experimental results arepresented in section 7. Conclusion and future works are presented in Section 8.

II. FAULT TOLERANCE IN GRID COMPUTING

Failure in large-scale Grid systems is and will be a fact of life. Hosts, networks, disks and applications

frequently fail, restart, disappear and behave otherwise unexpectedly. Support for the development of fault-tolerant applications has been identified as one of the major technical challenges to address for the successfuldeployment of computational grids [3,4,5]. Three techniques for fault tolerance for grid computing have been ofparticular importance: (i) checkpointing, or periodically saving the state of a process running on a computationalresource so that, in the event of failure, it can be migrated to an operational resource [6,7], (ii) replication, ormaintaining a sufficient number of replicas, or copies, of a process executing in parallel with identical state but ondifferent resources, so that at least one replica is guaranteed to finish the process correctly [8,9,10] and (iii) in theevent of failure, rescheduling, or finding different resources to that can accept and run failed tasks. The replicationof data is an important aspect of providing fault tolerance in data grids [11,12,13]. Several approaches for theimplementation of fault tolerance in message-passing applications exist. MPICHGF [14] is checkpointing systembased on MPICH-G2 [15], a Grid-enabled version of MPICH. It handles checkpointing, error detection, andprocess restart in a manner transparent to the user [16]. Pawel Garbacki et al. address the problem of makingparallel Java applications based on Remote Method Invocation (RMI) fault tolerant in a way transparent to theprogrammer [17]. Azzedin and Maheswaran [18] suggested integrating the trust concept into grid resource

management. Abawajy [19] presented a Distributed Fault-Tolerant Scheduling (DFTS) to provide fault tolerancefor jobs execution in a grid environment. Song [20] developed a security-binding scheme through site reputationassessment and trust integration across grid sites. A Fuzzy-logic based Self-Adaptive job Replication Scheduling(FSARS) algorithm is proposed to handle the fuzziness or uncertainties of job replication number which is highlyrelated to trust factors behind grid sites or user jobs was presented by Congfeng Jiang et al [21].

III. RELATED WORK

In centralized systems, decisions are made by a central controller, which maintains all information about theapplications and keeps track of all available resources in the system. Centralized systems are simple to implement,easy to deploy, and presents few management hassles. However, it is not scalable with respect to the number gridresources. For hierarchical systems, there is a central manager and multiple lower-level managers. This centralmanager is responsible for handling the complete execution of an application and assigning the individualinformations of this application to the low-level. Whereas, each lower-level is responsible for mappinginformation onto grid resources. The main advantage of using hierarchical architecture is that differentmanagement policies can be deployed at central manager and lower-levels. However, the failure of the centralmanager results in entire system failure. Ranganathan and Foster [22, 23] describe and evaluate various replicationstrategies for hierarchical data grids. These strategies are defined depending on when, where, and how replicas arecreated and destroyed. They compare six different replication strategies: No Replication, Best Client, Cascading,Plain Caching, Caching plus Cascading and Fast Spread. One of the enhancements is support for the hierarchicaldesktop grid concept as described by Kacsuk et al. [24], which allows a set of projects to be connected to form adirected acyclic graph where work is distributed among the edges of this graph. The hierarchy concept is solvedwith the help of a modified BOINC client application, the Hierarchy Client, which is always running beside anychild project, and its only task is to connect to the parent desktop grid, report itself as a powerful client consistingof a given number of processors, and inject fetched workunits into the local desktop grids database. Generally, aproject acting as a parent does not have to be aware of the hierarchy; it only sees the child desktop grid as onepowerful client. Marosi et al. [25] show how to implement automatic application deployment in hierarchical

desktop grid systems, thus administrators of lower level desktop grids do not have to deal with deploying


3/13


311

applications of higher level parent desktop grids. Farkas et al. [26] describe an important property of schedulingalgorithms for hierarchical desktop grid systems that is each child desktop grid runs an instance of one of thescheduling algorithms. The task of the scheduling algorithm is not to send workunits to attach clients, but todetermine a number of CPU cores reflecting the performance of the given desktop grid reported by the HierarchyClient. As the child desktop grid connects to its parent, it will represent itself as powerful client consisting of so

many cores, so it will process at most so many workunits originating from its parent in parallel.All these worksdid not takeinto accountthe characteristics ofdynamic natureofgrid resources, that

can unpredictably appear and disappear and the heterogeneous aspect of its resources. However, dynamicnature of these resources and its heterogeneity impose further challenge in seamless collaboration. Inthispaper,we proposea dynamichierarchical fault tolerance model based on distribution and swapping techniques

IV. PROPOSED MODEL

A. Grid model

The model supposes that the grid (see Figure 1) is a finite set of G clusters Ck, interconnected by gates gtk, k

{0, ...,G1}, where each cluster contains one or more sites Sjk interconnected by switches SWjk and every site

contains some Processor Elements PEijk and some Storage Elements SEijk, interconnected by a local areanetwork.

Figure 1. : Grid topology

B. Dynamic Hierarchical Model

The hierarchical architecture is frequently used for designing complex systems. It organizes the system in

hierarchical levels. We propose to structure the resources of the grid in a dynamic n-ary treecomposed of nlevels described as follows (see Figure 2):

Level leafs:Every node at this level represents a leaf that has the following functions:

Execution of jobs,

Sending the states of the jobs to the superior level (parent).

Intermediate levels: Nodes at this level have the functions described below:


4/13


312

Detection and fault tolerance of nodes at the lower levels (the children),

Updating the status of childrens jobs,

Sending the states of their children to the parent.

Level root: This level corresponds to the root of the tree. It constitutes of a node that is associated with theentire grid and is called the manager of the grid. Its role pertains:

Detection and fault tolerance of its children,

Updating the status of its childrens jobs.

The user submits a job to the root, which distributes the job fairly to its children. This process spreads in all theintermediate levels until the leaves, which are designated to execute the jobs. The results are then transmitted totheir parents until the root, which transmits them to the end user.

Figure 2. : Dynamic Hierarchical Model of a grid.

Where:

R:Root.

DR: Duplicate Root

N:Node.

Ni,j,k: indicate for each node (level, number of its parent, its number).

In order to build a balanced initial tree, depending on the number of the children and the number of levels, wedefined the function A (L, F) returns the number of nodes required to build the tree requested.

A (L, F) = where (F > 1)

Where:

F:Children number for every node.

L:Number of levels.

Example:To build a balanced tree of 3 levels and 2 children for each parent, we need 15 nodes.

Leaf level


5/13


313

C. Family concept

Because of the dynamicity of nodes in the tree, we define the notion of family; each node of the tree has afamily, possibly consisting of a parent, siblings and children. Family members vary according to the position ofthe node in the tree (see Figures 3).

Leaves: have siblings and a parent. Intermediate nodes: have all family members.

Root: it has only children.

Figure 3. Family of the node Ni,k,L

V. FAULT TOLERANCE MODEL

A. Types of faults

Our system is able to detect and tolerate crash faults and disconnection faults.

Crash fault: In this case, an entity stops abruptly and is no longer accessible. A crash fault may occur at aleaf, an intermediate node, or the root.

Leaf level: When a leaf is affected by a crash fault, the execution of the jobs is stopped. We cannot submit ajob to this leaf and it can no longer send a message to its parent.

Intermediate level:When a node at this level suffers from a crash fault, its manager fault tolerance is stopped.It cannot send messages to the parent and it can no longer receive messages from its children (messagesstatements).

Root:When the root experiences a crash fault, all the information of the grid is lost.

Disconnection Fault:In this type of fault, there is a failure of the communication medium. This type of faultoccurs when there is an error in the management of the communication between the different elements of the grid.For example, a fault in the DNS Manager, a connection problem (wiring), or a problem in the system files.

B. Fault Detection

Fault detection is a crucial for providing a scalable, dependable and highly available grid computingenvironment. We use the heartbeat message at intermediate nodes and the root. The detection of this type of faultis the responsibility of the parent. Each parent periodically sends a heartbeat message to all its children, when itreceives no reply from one of its children; it waits for a certain time interval. If it receives its state by another path(through another node of the tree), it is a disconnection fault, otherwise it is a crash fault.

C. Fault tolerance

The mechanism of fault tolerance follows the hierarchical model; each parent tolerates faults of its children,we propose in this model two techniques of fault tolerance based on the distribution and the swapping. When aparent detects a childrens crash fault, it counts its siblings; if the children is an only child, it uses the swappingmethod; otherwise it uses the distribution method.

1)

Distribution method

Parent of Ni,k,L

Siblings of Ni,k,L

Children of Ni,k,L


6/13


314

Upon the detection of a crash fault in the tree, its tolerance is the responsibility of its parent. It consists indistributing the children of the failed node over its siblings; its parent counts the children of the failed node, and itconnects them fairly on its siblings; after the repair of the failed node, it becomes a leaf in the tree (see Figure 4).The distribution of nodes is from left to right.

Figure 4. Distribution method: Failed node has three children and two siblings

2) Swapping methodIt consists in replacing a failed node by a leaf in the tree, we choose the least loaded substitute to minimize the

jobs to tolerate and after repairing of the failed node, it will become a leaf.

We use this method in two cases:

Swapping a node of the intermediate level:When a parent detects a childrens crash fault, it seeks the leastloaded leaf to replace the failed node and we tolerate the leaf's jobs in its siblings. If there are still not allowedjobs, it transmits them to the next level (see Figure 5).

Swapping of the root by the duplicate root:If the root fails we lose all the jobs submitted in the grid. Forthis, we added to the model, a Duplicate Root (DR), which is a child of the root. The DR is assigned to detectcrash faults of the root and to tolerate its fault by swapping method; the root periodically updates the DR. Whenthe root breaks down, the DR swaps it and the DR will be replaced by a leaf of the tree (we choose the leastloaded leaf and we tolerate its jobs) and the root becomes a leaf after repair.

D. Disconnection fault tolerance

The disconnections faults are tolerated from children to parent, when a node disconnects with its parent, itconsults its siblings to find a path through one of them, once this path is found, it transmits its status to its parent,but if it can not find a path, the parent will be considered in crash fault.


7/13


315

E. Tree Restructuring

Following changes in the structure of the tree, mostly due of some faults tolerance by the distributiontechniques and the insertion of new leaves; these changes make the tree unbalanced compared to its initial statewhere the children are distributed equitably in all levels. We supported our model by a tree restructuring method,this restructuring is managed by the root; the user defines the number of children for every parent (F) and

according to available nodes in the tree (Y), we calculate the number of levels by the formula

(this formula is deduced from the function A (L, F)). The tree restructuring is costly;it is initiated only when it is necessary.

Figure 5. Swapping a node in the intermediate level.

VI. IMPLEMENTATION

A. Service DHM-FTGrid architecture

We have integrated our grid service DHM-FTGrid over Globus GT4, DHM-FTGrid is composed of 4 basicmodules (see Figure 6):

1. DBTFjobrepresents the database DBTFjob, which contains the following tables db_children, db_familyand db_jobs.

2. Job Manager:it took over the job from its submission on the grid until its affection to a leaf, once the job

finished, it will transmit its results. It is responsible for updating the tables DBTFjob.


8/13


316

3. Fault Detector:each manager detects a fault of its children by sending periodic status jobs, in the lack of asending state, the manager sends a ping to the children, if it receives no response then a fault is detected in thisnode and it transmits its status to the Fault Manager.

4. Fault Manageris responsible to tolerate the failed node by applying the algorithms explained below; itworks in conjunction with the Job Manager, which redistributes the jobs of failed nodes locally or at a higher

level.

Our model is composed of a set of hierarchical levels, the leaf level (L 0) is loaded to execute jobs and doesn'thave any task of fault tolerance; which is installed differently in first level (L1), the intermediate levels (Li) and theroot.

Figure 6. DHM-FTGrid architecture

B. Level L1

Data structure:We use in every level of the tree three tables: db_children, db_family and db_jobs defined asfollows:

Table db_children: is composed of attributes id_n0, ip_n0, nb_job, size_list and size_list_free.

id_n0: node identifier.ip_n0: IP address of the node.nb_job: number of jobs.size_list: the maximum size of the queue.size_list_free: free size of the queue.

Table db_family: The attributes of this table are : id_node, ip_node and type.id_node: node identifier.ip_node: IP address of the node.Type: type of node can be aparentor sibling.

Table db_jobs: is composed of the attributes: id_n0, id_job, job, job_state, duplique and tolerated.id_job: job identifier.

Job:contains the following parameters: executable, argument, stdout, stderr.

Job_state: the status of job can be active(in execution), Failed(down), Pending(pending) orDone(Free).Duplique: the job can be passive or active duplication.Tolerated:indicates whether the job is tolerated or not.

When the node in level L1detects a fault of it children, it launched the process of fault tolerance in two phases,first it selects its jobs from the table db_jobs, it runs a local fault tolerance by distributing jobs on its children,according to their charges, if there are still not allowed jobs, it transmits them to the next level (L2).

C. Intermediate levels (Li)

Data Structures:Each table db_children and db_jobs is an aggregation of data stored in the tables of the level(Li-1) by adding an attribute indicating the ID of each node (level L i-1).

Fault tolerance:When the manager fault tolerance at level (L i) detects a fault of its children, the fault

tolerance technique is as follows (see algorithm 1):1.The manager counts the number of the siblings of the failed node.

UserSubmit Job

Results

Job Manager

(Submits & Monitors job)

Fault Detector

Fault

Manager

DBTFjob


9/13


317

2. If this number is greater or equal to 2, it will use the distribution method (see Algorithm 2).3. Otherwise it will use the swapping method (see algorithm 3).5. Sends the state to its parent.

Algorithm1:TF_Ni(db_children, db_jobs, ID_Ni)

Begin1. db_children.first();

2. Whilenot db_children.eof() do

3. ID = db_children.ID_Ni-1;

4. If(ping (ID_Ni, ID) == false) then

5. wait_result();

6. If(wait_result()== false) then

7. K = search_nbfailed_sibling (db_children);

8. If(k >= 2) then

9. distribute_failed_children (ID, k, db_children, db_jobs);

10. Else11. swap_failed (ID, db_children);

12. End if

13. End if

14. Else15. Ifcompleted_job ( ID, List_job) then

16. updatedb_jobs setjob_state = "Done" whereid_job in Liste_job[id_job];

17. End if18. If complete_job (id_ni+1, Liste_job) then

19. updatedb_jobs setjob_state=="Done" whereid_job in Liste_job[id_job];

20. End if21. send_state_job_Done(ID);

22. End if

23. db_children.Next();

24 send state_Ni_to_Ni+1();

25. End while

End.

Algorithm 2: distribute_failed_children (ID, k, db_children, db_jobs)Begin1. failed_node = [select * from db_children whereid_ni-1==ID];

2. F = [selectid_ni-1 from db_children whereid_ni-1 !=ID];

3. L = failed_node.count;

4. N = L div k;

5. M = L mod k;

6. failed_node.first();

7. IfN!=0 then

8. F.first();

9. For(s=1;s


10/13


318

Begin1. failed_children = [select* from db_children whereid_ni-1==ID];

2. J = [selectid_n0 from failed_children wherenb_job in (select min(nb_job)from failed_children)];

3. updatedb_children set id_ni-1=j whereid_ni-1==ID;

4. updatedb_children set id_fils=ID whereid_fils==J;

5. Toler_job(J,nb_job,db_jobs,ID,id_ni);

End.

D. Root Level

The data structures used in the root are the same used in level (Li) except that table db_family doesnt existand the technique of fault tolerance remains the same as intermediate levels, except that the root doesnt have ahigher level.

E. Duplicate Root

The DR is responsible to detect faults root, in the case of a disconnection fault, the DR receives updatesthrough one of its siblings and if the root fault is crash, the DR replaces the root, it selects a leaf to replace the DRand it tolerates the jobs of this leaf.

VII. EXPERIMENTATION

The aim of the experiments conducted in this model is to show the rate of contribution of each level of the treein the fault tolerance process and to check the most fault tolerance technique used (distribution or swapping). Toevaluate the performance of our model, we developed DHM-FTGrid over the Globus GT4 on Pentium 4 with aspeed of 2.8 Ghz, DDR 160 GB and RAM 1 GB. Our system operates under a hierarchical architecture using twotree models:

1. 8/4/2 model uses 15 nodes, distributed as follows: 8 leaves, 4 nodes in L1, 2 in L2 and a root.

2. 9/3 model uses 13 nodes, distributed as follows: 9 leaves, 3 nodes in L1 and a root.

A. Model 8/4/2

The first series of experiments is taken from the architecture 8/4/2, where we increased the number of jobsfrom 5 to 60, each node has a queue of 5 jobs.

We have noted the following conclusions:

1. The levels of tolerance are related to the number of jobs in the grid (see Figure 7),

2.

Tolerance techniques (distribution and swapping) are related to the number of levels and children,

3. In our case of experiments, we used much of the swapping method because of the reduced number ofchildren (see Figure 8),

4. The model waits failed jobs in the root when the rate of jobs submitted in the grid exceeds the size of thegrid by 250 %.

Figure 7. Levels of tolerance


11/13


319

Figure 8. Fault tolerance techniques

B. Model 9/3

The second series of experiments is taken from the architecture 9/3, where we increased the number of jobsfrom 5 to 50; each node has a queue of 5 jobs.

We have noted the following conclusions:

1.

The levels of tolerance are related to the number of jobs in the grid (see Figure 10),

2. Tolerance techniques (distribution and swapping) are related to the number of levels and children (seeFigure 9),

3. In our case of experiments, we used much of the distribution because of the higher number of children,

4. The model waits failed jobs in the root when the rate of jobs submitted in the grid exceeds the size of thegrid by 133%.

Figure 9. Fault tolerance techniques


12/13


320

Figure 10. Levels of tolerance

VIII. CONCLUSION

In this paper, we have proposed a model of fault tolerance adapted for grid computing that takes into accountthe dynamic nature of the resources, the scalability and the heterogeneity of the grid. This model is completelyindependent of any physical architecture. We model the grid as a dynamic virtual tree. This tree is composed of aroot for the grid, a set of intermediate levels and leaf levels designated to run jobs. We present two mechanismsfor fault tolerance based on distribution and swapping. We have implemented DHM-FTGrid a fault tolerance gridservice over Globus GT4. We observe that the dynamicity of the tree structure responds appropriately to thedynamic nature of grid resources. The technique of distribution adopted to tolerate faults in the intermediate levelsallows to keep jobs in their leaves and to reconnect the children of the failed nodes to the siblings their parentswithout any replication.

REFERENCES

[1] Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. eds., San Francisco, Calif.: Morgan Kaufmann

Publishers, 1999, 677 pages.[2] H.-M. Lee, K.-S. Chung, S. Jin, D.-W. Lee,W.-G. Lee, S. Jung, and H.-C. Yu. A fault tolerance service for QoS in grid computing.In

LNCS, pages 286296. Springer-Verlag., 2003.

[3] Garg, R., Singh, A. K.: Fault tolerance grid computing: state of the art and open issues . International Journal of Computer Science& Engineering Survey (IJCSES) Vol.2, No.1: 88-97, Feb 2011.

[4] Siva Sathya, S., Syam Babu, K.:Survey of fault tolerant techniques for grid, Computer Science Review, Vol.4, No. 2: 101-120, 2010.

[5] Dabrowski, C.: Reliability in grid computing Systems. Concurrency and Computation: Practice and Experience, Vol. 21, No. 8: 927-959. DOI: 10.1002/cpe.1410, 2009.

[6] Jin, H., Shi, X., Qiang W. and Zou., D. : DRIC: Dependable Grid Computing Framework,, IEICE - Transactions on Information andSystems, Vol.E89-D, No.2:612-623, February 2006 [doi>10.1093/ietisy/e89-d.2.612].

[7] Jafar, S., Krings, A., and Gautier, T.: Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing, IEEE Transactions onDependable and Secure Computing, Vol. 6, No. 1:32-44, JANUARY-MARCH 2009

[8] Lac C, Ramanathan S. A Resilient Telco Grid Middleware. Proceeding of 11th IEEE Symposium on Computers and Communications(ISCC'06), June 2006. IEEE Computer Society Press: Los Alamitos, CA, pp. 306-311, 2006.

[9]

Jiang. C., Xu. X., Wan, J.:Replication Based Job Scheduling in Grids with Security Assurance, Proceedings of the Third InternationalSymposium on Electronic Commerce and Security Workshops (ISECS 10) Guangzhou, P. R. China, 29-31, pp. 156-159, July 2010.

[10] Sangho, Y.,, Derrick, K., Bongjae, K., Geunyoung, P., Yookun, C.: Using Replication and Checkpointing for Reliable TaskManagement in Computational Grids, International Conference onHigh Performance Computing andSimulation (HPCS), pp 125 - 131, France2010

[11] Huedo, E., Montero, R., Llorente, I.: Evaluating the reliability of computational grids from the end user's point of view. Journal ofSystems Architecture, 52 (12): 727-736, 2006.

[12] OLTEANU, A., POP, F., DOBRE, C., CRISTEA, C.: Re-scheduling and error recovering algorithm for distributed environments,

U.P.B. Scientific Bulletin, Series C, Vol. 73, Iss. 1: 27-38, 2011.

[13] Leyli, M. K., Maryam E. F., Ali G., Reliable Job Scheduler using RFOH in Grid Computing, Journal of Emerging Trends inComputing and Information Sciences Vol. 1, No. 1:43-48, July 2010

[14] Woo, N., Jung, H., Yeom, H. Y., Park, T. and Park., H.: MPICHGF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes. IEICE Transactions on Information and Systems, 87(7):18201828, 2004.

[15] Nicholas, I. F., Karonis, T., Toonen, B: MPICH-G2: A Grid-enabled implementation of the Message Passing Interface. Journal ofParallel and Distributed Computing, 63(5):551563, May 2003.


13/13


321

[16] Daz, D., Pardo, X. C., Martn, M. J., Gonzlez, P.: Application-Level Fault-Tolerance Solutions for Grid Computing; Eighth IEEE

International Symposium on Cluster Computing and the Grid (CCGRID08), IEEE Computer Society, Washington, USA, pp. 554-559,2008.

[17] Pawel, G., Bartosz, B., Henri, E. B.: Transparent Fault Tolerance for Grid Applications. Proceedings of the European Grid Conference(EGC 2005), Amsterdam, The Netherlands, pp. 671-680, 2005.

[18] Azzedin, F., Maheswaran, M.: Integrating trust into grid resource management systems; In: Proceedings of the International

Conference on Parallel Processing (ICPP02), IEEE Computer Society Press, Los Alamitos, pp. 4754, 2002.[19] Abawajy, J.: "Fault-Tolerant Scheduling Policy for Grid Computing Systems"; In Proceedings of the 18th International Parallel and

Distributed Processing Symposium, IPDPS04, Santa Fe, New Mexico, pp. 238244, 2004.

[20] Song, S., Hwang, K., Kwok, Y.: Trusted grid computing with security binding and trust integration; Journal of Grid Computing 3,pp. 5373, 2005.

[21]Jiang, C., Wang, C., Liu, X., Zhao, Y.: A Fuzzy Logic Approach for Secure and Fault Tolerant Grid Job Scheduling ; Autonomicand Trusted Computing, 4th International Conference, ATC 2007, Hong Kong, China, Volume 4610, pp. 549-558 of Lecture Notes inComputer Science, Springer, July 11-13, 2007.

[22] Kavitha Ranganathan, Ian T. Foster.Identifying Dynamic Replication Strategies for a High-Performance Data Grid.In Proceedings ofGRID'2001. pp.75~86

[23] Ranganathan, K., Foster, I.: Design and evaluation of dynamic replication strategies for a high performance data Grid. In:International Conference on Computing in High Energy and Nuclear Physics, 2001.

[24] P. Kacsuk, N. Podhorszki, T. Kiss, Scalable desktop grid system, in: High Performance Computing for Computational ScienceVECPAR 2006, in: Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2007, pp. 2738.

[25] A. Marosi, G. Gombas, Z. Balaton, Secure application deployment in thehierarchical local desktop grid, in: Proceedings of the 6thAustrianHungarian Workshop on Distributed and Parallel Systems, DAPSYS 2006, pp. 145154.

[26] Z. Farkas, A.C. Marosi, P. Kacsuk, Job scheduling in hierarchical desktop grids, in: Remote Instrumentation and Virtual Laboratories,Springer US, 2010, pp. 7997.

[27] M. REBBAH, C. MOKHTARI, M. KHALDI, M.F. BOURASI, O. SMAIL, Hierarchical model for fault tolerant grid computing overGlobus Toolkit. International Congress on Models, Optimization and Security of Systems (ICMOSS2010), Tiaret, May, 2010

[28]

B. Yagoubi, Y. Slimani. Task load balancing strategy for grid computing.Journal of Computer Science3 (3):186-194, ISSN 1546-9239, 2007.

[29] Globus Toolkit http://www.globus.org[Oct, 20, 2011].

dynamic hierarchical model for fault tolerant grid computing

Documents