final results on architectures and ... - newcastle universityresearch.cs.ncl.ac.uk/cabernet/ ·...

1

DSoS

IST-1999-11585

Dependable Systems of Systems

Final Results on Architectures and Dependable Mechanisms for Dependable SoSs

Report: Deliverable DSC1

Report Delivery Date: 30 March 2003

Classification: Public Circulation

Contract Start Date: 1 April 2000 Duration: 36m

Project Co-ordinator: University of Newcastle upon Tyne

Partners: QinetiQ – UK; INRIA, Rocquencourt – France; LAAS-CNRS, Toulouse – France; TU Wien – Austria; Universität Ulm – Germany; LRI, Orsay - France

Project funded by the European Community under the “Information Society Technology” Programme (1998-2002)

3

List of Authors

J-C. Fabre...................................................................................................................... LAAS-CNRS, Toulouse, F

R. de Lemos ........................................................................................................................University of Kent, UK

C. Gacek ..................................................................................................University of Newcastle upon Tyne, UK

M-C. Gaudel ...................................................................................................................................... LRI, Orsay, F

N. Georgantas ...................................................................................................................INRIA, Rocquencourt, F

V. Issarny (Editor)........................................................................................................... INRIA, Rocquencourt, F

N. Levy .............................................................................................................................INRIA, Rocquencourt, F

A. Romanovsky........................................................................................University of Newcastle upon Tyne, UK

F. Tartanoglu.....................................................................................................................INRIA, Rocquencourt, F

5

Table of Contents

List of Authors .................................................................................................................................................... 3 Table of Contents ................................................................................................................................................ 5 Introduction......................................................................................................................................................... 7

WP Achievements........................................................................................................................................... 7 Related demonstrators..................................................................................................................................... 7 Content of the Report ...................................................................................................................................... 8

Chapter 1 – Architecting DSoSs using the Web Services Architecture ............................................................ 11 1.1. Introduction....................................................................................................................................... 11 1.2. Dependability in the Web Services Architecture .............................................................................. 12

1.2.1 The Web Services Architecture........................................................................................................ 12 1.2.2 Backward Error Recovery for the Web ............................................................................................ 15 1.2.3 Forward Error Recovery for the Web............................................................................................... 18

1.3. From Formal Specification of Composite Web Services to Development Support .......................... 20 1.3.1 Building WSCA-based Composite Services .................................................................................... 20 1.3.2 Specifying Composite Services using WSCAL ............................................................................... 21 1.3.3 From Formal Specification to Implementation ................................................................................ 26

1.4. Enforcing atomicity of Composite Web Services ............................................................................. 27 1.4.1. Guidelines for specifying forward error recovery based on transactional models for multi-database systems ....................................................................................................................................... 27 1.4.2. From systematic undo in collaborative applications to semantic atomicity in composite Web services ................................................................................................................................................... 28

1.5. Conclusion ........................................................................................................................................ 30 Chapter 2 – Architectural Mismatch Tolerance ................................................................................................ 31

2.1. Introduction....................................................................................................................................... 31 2.2. Background ....................................................................................................................................... 32

2.2.1. Architectural Features ............................................................................................................... 32 2.2.2. Style-specific and Application-specific Mismatches................................................................. 32

2.3. Dependability .................................................................................................................................... 33 2.4. Dependability and Mismatches ......................................................................................................... 34

2.4.1. Mismatch Prevention................................................................................................................. 34 2.4.2. Mismatch removal..................................................................................................................... 35

2.5. Architectural Mismatch Tolerance.................................................................................................... 35 2.5.1. ECM Processing........................................................................................................................ 36 2.5.2. Mismatch Treatment ................................................................................................................. 37

2.6. Examples ........................................................................................................................................... 38 2.6.1. Style-specific Mismatch ............................................................................................................ 39 2.6.2. Application-specific Mismatch ................................................................................................. 40 2.6.3. Style- and Application-specific Mismatch ................................................................................ 41

2.7. Conclusions ....................................................................................................................................... 42 References......................................................................................................................................................... 45

7

Final Results on Architectures and Dependable Mechanisms for Dependable SoSs

J-C. Fabre1, R. de Lemos2, C. Gacek3, M-C. Gaudel4, N. Georgantas5, V. Issarny5, N. Levy5, A. Romanovsky3, F. Tartanoglu5

1LAAS-CNRS (Toulouse, F), 2University of Kent (UK),

3University of Newcastle upon Tyne (UK), 4LRI (Orsay, F), 5INRIA (Rocquencourt, F)

Introduction

WP Achievements The main objective of the DSoS Workpackage on architecture and design was to deliver the definition of an environment for the construction of dependable SoSs out of heterogeneous systems and dependability mechanisms assisting in building such systems. Towards that aim, the project has conducted research work in the three following complementary areas:

• Provision of an environment that assists developers during the design phase of DSoSs, through support for architectural modelling and quality analysis of DSoSs. In the environment, DSoS architectures are specified using an extensible UML-based Architecture Description Language (ADL). Extensibility of the ADL allows customisation of the architectural elements’ specification, enabling in particular specialization of the ADL for the quality analysis of DSoSs. In this context, we have been more specifically concentrating on providing support for the quality analysis of DSoSs at design time, while not requiring extensive knowledge of formal modelling techniques (e.g., process algebra, Markov chains, Petri nets, queuing nets) from developers. The ADL has been customized so as to allow: (i) quantitative analysis of the DSoSs quality from the standpoint of reliability and performance, and (ii) qualitative analysis of the DSoSs quality based on model checking.

• Thorough study of fault tolerant mechanisms based on forward error recovery for the development of DSoSs. The focus has in particular been on dealing with large complex DSoSs that may evolve. Results in this context relate to: (i) handling on-line upgrades of component systems at the level of LIFs, (ii) tolerating architectural mismatch at runtime, and (iii) supporting the development of dependable composite Web services.

• Provision of a wrapping technology for improving the robustness of a middleware-based SoSs. This work has further led to the design and implementation of a fault tolerant middleware platform, which is based on reflective architectural concepts and allows for the transparent integration of wrappers. Related Demonstrators

Results in the above areas are presented in Project Deliverables IC2, PCE1, CSDA2, and DSC4 and in this report. In particular, Deliverables CSDA2 and DSC4 discuss demonstration of prototype implementation of the resulting software systems.

Related demonstrators Three major demonstrators have been developed in the context of the DSoS workpackage on architecture and design. These demonstrators integrate prototype implementation of the software systems that are respectively related to results in the areas of development environment, dependability mechanisms based on forward error recovery, and wrapping technology. Architecture-based development environment: The demonstrator of the architecture-based development environment that is presented in Deliverable CSDA2, illustrates the project work on offering supports for the design and analysis of DSoSs at the architectural level. The prototype implementation of the environment builds upon the Rationale Rose UML modelling tool, which allows the definition of add-ins for the integration of new functionalities. The Rationale Rose tool has more specifically been extended with specific add-ins for architecture modelling and for quality analysis. The latter type of add-ins allows for specifying quality parameters associated with the system’s architectural model, and for generating quality models to be processed

8

by existing tools out of the attributed architectural models. Prototype usage has further been experimented using the Travel Agent case study that serves for demonstration. Internet Travel Agency: The demonstration of the Internet Travel Agency that is presented in Deliverable DCS4, illustrates the project work on dependability mechanisms for systems of systems and more specifically on: (i) investigating means of supporting the architectural construction of dependable SoSs through the interconnection of component systems at their LIFs, and (ii) designing and prototyping mechanisms for supporting dependable services in an evolving SoS. The particular focus of the demonstrator is on applying: (i) Coordinated Atomic (CA) actions [Xu et al. 1995] to integrating dependable Internet applications, and (ii) a general framework for handling on-line upgrades of component systems at the level of LIFs ([Jones et al 2002], [Jones et al 2003]) in the Internet environment. Fault tolerant middleware-based system with wrappers: The demonstration of the fault-tolerant middleware-based system that is presented in Deliverable DSC4, illustrates various aspects of the work carried out in DSoS during the lifetime of the project. It is based on a fault tolerant middleware platform which has been developed on purpose and which is based on reflective architectural concepts developed by LAAS for several years [Ruiz-García et al. 2003]. This new platform is composed of various executive layers, namely operating system (Linux V2.5 for x86), a Java Virtual Machine (Java Hot Spot 1.4.1) and a CORBA middleware layer (JOrbacus V4.1.2). The latter is a Java-based implementation of CORBA. The implementation of fault tolerance mechanisms was done using the notion of Portable Interceptors (cf. [OMG 2002]). These have been customized to implement a primary-backup strategy for a banking service. The use of standardized interceptor mechanisms allows the fault tolerance mechanisms to be implemented separately from the application code, and makes it possible for them to be integrated transparently. Remote clients using a similar system configuration can access the provided services through the Internet. In case of primary crash, the backup delivers the expected service to the clients. The fault model associated to this primary-backup strategy is the crash of the server host. However, other kinds of faults may impair the service, namely communication faults (see DSOS workpackage VA, Deliverable IC3) and executive faults (see DSOS workpackage AD, Deliverables IC2 and PCE1) hanging the server application. This kind of faults cannot be tolerated by the native implementation. We demonstrate, that thanks to the use of wrappers, these faults can be detected and the strategy improved to tolerate them. Two kinds of wrappers are used in the demonstration: IIOP message wrappers (communication wrapper) and thread synchronization wrappers (executive support wrappers). This demonstration aggregates lessons learnt from the work carried out in both DSoS workpackages AD and VA. It shows that, thanks to characterization inputs (regarding both the operating system and the middleware levels), the wrapping technology proposed in DSoS can significantly improve the robustness of a middleware-based System-of-Systems. Presentation of the demonstrator is given in Project Deliverable DCS4.

Content of the Report The remainder details project work on the architecture and design of DSoSs that has been carried out over the reporting period. Chapter 1 details the final design of development support for the dependable composition of Web services. The Web services architecture is expected to play a major role in developing next generation distributed systems of systems. However, the architecture needs to evolve to support all the requirements appertained to distributed systems. Addressing such requirements relates, in particular, in reusing solutions from the distributed system community. However, most solutions will not be reusable as is, mainly because of the openness of the Internet. Hence, making evolve the Web services architecture to support the thorough development of distributed systems raises a number of challenges. The DSoS project has been investigating one of the issues raised in this context, which is the dependable composition of Web services, i.e., understanding how fault tolerance should be addressed in the Web services architecture. This leads the project to make two complementary contributions:

• Proposing a solution to the architecting of DSoSs based on the Web services architecture, • Proposing a solution to making composite Web services dependable.

While dependability in closed distributed systems is conveniently addressed by transactions when concerned with both concurrency control and failure occurrences, it can hardly rely on such a mechanism in an open environment. A major source of difficulty lies in the use of backward error recovery which is mainly oriented towards tolerating hardware faults but poorly suited to the deployment of cooperation-based mechanisms over autonomous component systems that often require cooperative application-level exception handling among component systems. One solution to this concern lies in forward error recovery that enables accounting for the

9

specific of Web services and that leads to structure Web services-based systems in terms of co-operative actions. In particular, we are able to address dependable service composition in a way that neither undermines the Web service's autonomy nor increases their individual access latency. Chapter 2 introduces a solution to the handling of architectural mismatches at runtime. The integrity of complex software systems built from existing components or component systems is becoming more dependent on the integrity of the mechanisms used to interconnect these components and, in particular, on the ability of these mechanisms to cope with architectural mismatches that might exist between components. There is a need to detect and handle (i.e., to tolerate) architectural mismatches during runtime because in the majority of practical situations it is impossible to localize and correct all such mismatches during development time. When developing complex software systems, the problem is not only to identify the appropriate components, but also to make sure that these components are interconnected in a way that allows mismatches to be tolerated. The resulting architectural solution should be a system based on the existing components, which are independent in their nature, but are able to interact in well-understood ways. To find such a solution we apply general principles of fault tolerance to dealing with architectural mismatches.

11

Chapter 1 – Architecting DSoSs using the Web Services Architecture Ferda Tartanoglu (INRIA), Marie-Claude Gaudel (LRI), Nikolaos Georgantas (INRIA), Valérie Issarny (INRIA), Nicole Levy (INRIA), Alexander Romanovsky (University of Newcastle)

1.1. Introduction The Web services architecture targets the development of applications based on XML-related standards, hence easing the development of distributed systems through the dynamic integration of applications distributed over the Internet, independently of their underlying platforms. A Web service is a software entity deployed on the Web whose public interface is described in XML. A Web service can interact with other systems by exchanging XML-based messages, using standard Internet transport protocols. The Web service's definition and location (given by a URI) can be discovered by querying common Web service registries. Web services can be implemented using any programming language and executed on heterogeneous platforms, as long as they provide the above features. This allows Web services owned by distinct entities to interoperate through message exchange. Although the modularity and interoperability of the Web services architecture enable complex distributed systems to be easily built by assembling several component services into one composite service, there clearly is a number of research challenges in supporting the thorough development of distributed systems based on Web services. One such challenge relates to the effective usage of Web services in developing business processes, which requires support for composing Web services in a way that guarantees dependability of the resulting composite services. This calls for developing new architectural principles of building such composed systems, in general, and for studying specialized connectors "gluing" Web services, in particular, so that the resulting composition can deal with failures occurring at the level of the individual component services. Several properties of the Web services architecture must be taken into account while addressing the above issues. Web services are decentralized in architecture and in administration. Therefore, individual Web services can have different characteristics (e.g., transactional supports, concurrency policies, access rights), which may not be compliant with each other. Moreover, Web services use Internet transport protocols (e.g., HTTP, SMTP) and interacting with them requires dealing with limitations of the Internet such access latency, timeouts, lost requests and security issues. These specifics of Web services require special care for supporting dependability of complex distributed systems in integrating them. The provision of effective support for the dependable integration of Web services is still an open issue, which has led to tremendous research effort over the last couple of years, both in industry and academia. Since the emerging acceptance of Web services, the DSoS project has started investigating support for the development of dependable composite Web services, as they will undoubtedly become a major class of DSoSs. First results of this work are presented in Project Deliverable PCE1, which introduces an approach to the dependable composition of Web services, based on forward error recovery, further leading to introduce preliminary definition of the WSCAL XML-based language for supporting the development of associated composite Web services. The project work over the reporting period has focused on consolidating the proposed solution by:

• Thoroughly assessing existing work in the area of dependable composition of Web services as it is a very active area of research. In particular, we wanted to thoroughly understand existing solutions relating to research, standardization and development, including latest solutions from major industrial consortia. Those solutions together with ours are compared using the travel agency example, which leads us to highlight the contribution of our work (Section 1.2).

• Revising the definition of the WSCAL language supporting the development of dependable composite Web services, based on the formal specification of coordinated atomic actions on which rely our solution and that was introduced in the Appendix of PCE1. This work includes deeper analysis of possible recovery actions, leading to improve some features on the language. In addition, this work gives us a sound base ground towards offering a correct implementation of the associated runtime support (Section 1.3).

• Investigating further advanced fault tolerant support for composite Web services. The WSCAL language currently allows structuring dependable composite services and specifying forward error recovery actions under the occurrence of faults. It is then up to the developer to specify the recovery actions according to the specifics of the application. Further guidelines are offered for this activity in the light of work in the area of multi-database systems. In addition, we have conducted a

12

preliminary study towards offering automated support for undoing the effect of compensatable operations over shared Web services (Section 1.4).

Although the DSoS project is ending, the project’s partners intend to pursue their joint work in the domain of dependable composition of Web services. In this context, we have initiated a prototype implementation of the proposed development support for the sake of experiment and are investigating further extension (Section 1.5).

1.2. Dependability in the Web Services Architecture This section provides an overview of latest work in the area of dependable composition of Web services [Tartanoglu et al. 2003b]. For the sake of clarity, Section 1.2.1 briefly recalls the base Web services architecture, and discusses Web services composition and related dependability requirements, providing an update to the state of the art survey that was offered in Deliverable PCE1. Then, proposed fault tolerance mechanisms for composite Web services are surveyed: Section 1.2.2 overviews transaction protocols for the Web based on backward error recovery, and Section 1.2.3 presents how forward error recovery can be applied to Web services composition processes. Finally, Section 1.2.4 summarizes our analysis. 1.2.1 The Web Services Architecture Although the definition of the overall Web services architecture is still incomplete, the base standards have already emerged from the W3C, which define a core middleware for Web services, partly building upon results from object-based and component-based middleware technologies. These standards relate to the specification of Web services and of supporting interaction protocols. Furthermore, there already exist various platforms that are compliant with the Web services architecture, including .NET, J2EE and AXIS. Figure 1.1 depicts the technology stack of the base Web services architecture, each layer being defined with common protocol choices. Main standards for the Web services architecture being defined by the W3C Web Service Activity and the Oasis Consortium are the following:

• SOAP (Simple Object Access Protocol) defines a lightweight protocol for information exchange that sets the rules of how to encode data in XML, and also includes conventions for partly describing the invocation semantics (either synchronous or asynchronous) as well as the SOAP mapping to an Internet transport protocol (e.g., HTTP).

• WSDL (Web Services Description Language) is an XML-based language used to specify: (i) the service's abstract interface that describes the messages exchanged with the service, and (ii) the concrete binding information that contains specific protocol-dependent details including the network end-point address of the service.

• UDDI (Universal Description, Discovery and Integration) specifies a registry for dynamically locating and advertising Web services.

Figure 1.1 Web services architecture

Composing Web services relates to dealing with the assembly of autonomous components so as to deliver a new service out of the components' primitive services, given the corresponding published interfaces. We use the travel agent case study to illustrate a composite Web service. The travel agent service assists the user in booking complete trips according to his/her requests and is built by composing several existing Web services (e.g., accommodation and flight booking, and car rental Web services) located through a public Web services registry (see Figure 1.2). Each Web service is an autonomous component, which is not aware of its participation into a composition process. A typical scenario for the travel agent service is as follows:

• The user interacts only with the travel agent whose internal computations are hidden from the user. • The user sends a query to the travel agent with the date for the travel and the itinerary. • The travel agent proposes to the user complete trips satisfying his/her request, after querying

appropriate accommodation, flight and car rental Web services. • The travel agent makes all the respective bookings and reservations on Web services, according to the

user's choices, and returns him/her a confirmation.

13

• If a full trip cannot be found, the travel agent tries to offer the user some replacement trips but no action can be taken without his/her approval.

Figure 1.2 A composite Web service example: The travel agent service

In the current Web services architecture, interfaces are described in WSDL and published through UDDI. However, supporting composition requires further addressing: (i) integration of component services and (ii) the specification of the composition. This calls for the abstract specification of Web services behaviours and of their composition that allows reasoning about the correctness of interactions with individual Web services. In addition, the composition process must not only define the functional behaviour of the composite service in terms of interactions with the composed services, but also its non functional properties, possibly exploiting middleware-related services (e.g., services relating to WS-Security for enforcing secure interactions). Various non-functional properties (e.g., availability, extendibility, reliability, openness, performance, security, scalability) should be accounted for in the context of Web services. However, enforcing dependability of composite Web services is one of the most challenging issues due to the concern for supporting business processes, combined with the fact that the composition process deals with the assembly of loosely-coupled autonomous components. Dependability of composite Web services is further a prime requirement due to the poor and unpredictable quality of existing Web services, and the fact that clients of the composite systems are expected to be in the common case, people with poor computer skills. The above aspects are discussed in the following paragraphs.

Figure 1.3 A conversation for the flight booking Web service

Specifying Conversations: To enable a Web service to be correctly integrated in a composition process, Web services conversations (also referred as to choreography) are to be provided in addition to WSDL interfaces for describing the observable behaviour of a Web service by specifying the protocol in terms of message exchanges that should be implemented for correct interaction with the service. As an illustration, a flight booking Web service can publish its behaviour as shown in Figure 1.3: the booking process starts with the call of the Login operation. If the login succeeds, the user can call the FlightRequest operation. Then, if the result returns a valid list of flights, the BookFlight operation is called by sending the FlightId number of the corresponding flight. The

14

conversation ends if (i) the Login operation fails, (ii) the user calls Logout during the booking process or (iii), the BookFlight operation terminates successfully by sending a confirmation message to the user. The W3C Choreography Working Group aims at extending the Web services architecture with a standard for specifying Web services conversations. Existing proposals to the W3C for specifying conversations include WSCL (Web Services Conversation Language) [W3C-WSCL] and WSCI (Web Service Choreography Interface) [W3C-WSCI]. In addition, WSCI can be used to describe additional properties related to the service behaviour (e.g., transactional properties and exceptional behaviour). Note that the specification of conversations, which enriches the definition of Web services interfaces, should be accounted for, for the definition of Web services registries like UDDI. Instances of Web services should be retrieved with respect to the definition of the service's interfaces, observable behaviours and non-functional properties (e.g., transactional behaviour). WSDL extensibility elements allow the Web service interface definition to be extended and can be, for example, used to add information describing conversations and other properties. In addition, WSCL can further be published in UDDI registries for retrieving Web services with respect to the conversations that the services support [OASIS-UDDI].

Figure 1.4 Composition process for the travel agent service

Specifying Composition Processes: Web services are integrated according to the specification of a composition process. Such a process may be specified as a graph (or process schema) over the set of composed Web services (see Figure 1.4). It is worth noting that if some Web services used in a composition process have associated conversation descriptions (e.g., WSCL or WSCI) for the operations they support, the overall composition must conform to all these descriptions. The specification of a composition graph may be:

• Automatically inferred from the specification of individual services as addressed in [Narayanan 2002]. • Distributed over the specification of the component Web services as in the XL language [Florescu

2002]. • Given separately with XML-based declarative languages as BPEL, BPML, CSDL and SCSL

[BPEL4WS, BPMI, Casati 2001, Yang 2002], or in the form of state-charts as undertaken in [Fauvet 2001].

Although there is not yet a consensus about the best approach for specifying composite Web services, it may be anticipated that this will most likely rely on the XML-based specification of a graph over Web services that is decoupled from the specification of the composed Web services. The main reasons that lead to this conclusion include compliance and complementarity with established W3C standards (i.e., WSDL and SOAP), thus providing reusability, openness and extensibility, but also the fact that it is the approach undertaken by most industrial consortia. Dependability Requirements: Composite Web services have high dependability requirements that call for dedicated fault tolerance mechanisms due to both the specifics of the Web services architecture and limitations of the Internet, which is not a reliable media [Kalyanakrishnan 1999]. The autonomy of component Web services raises challenging issues in specifying composition processes and in particular exceptional behaviours of composite services when dealing with faults. These faults include but are not limited to (i) faults occurring at the level of the Web services, which may be notified by error messages, (ii) faults at the underlying platform (e.g., hardware faults, timeouts), and (iii) faults due to online upgrades of service components and/or of their interfaces. In general, the choice of fault tolerance techniques to be exploited for the development of dependable systems depends very much on the fault assumptions and on the system's characteristics and requirements. There are two main classes of error recovery [Anderson & Lee 1981]: backward (based on rolling system components back to the previous correct state) and forward error recovery (which involves transforming the system components into any correct state). The former uses either diversely implemented software or simple retry; the latter is usually application-specific and relies on an exception handling mechanism [Cristian 1989]. It is a widely accepted fact that the most beneficial way of applying fault tolerance is by associating its measures with system structuring units as this decreases system complexity and makes it easier for developers to apply fault tolerance [Randell 1983]. Structuring units applied for both building distributed systems and providing their fault tolerance are well known: they are distributed transactions and atomic actions. Distributed transactions [Gray

15

1993] use backward error recovery as the main fault tolerance measure in order to satisfy completely or partially the ACID (atomicity, consistency, isolation, durability) properties. Atomic actions [Campbell 1986] allow programmers to apply both backward and forward error recovery. The latter relies on coordinated handling of action exceptions that involves all action participants. Backward error recovery has a limited applicability, and in spite of all its advantages, modern systems are increasingly relying on forward error recovery, which uses appropriate exception handling techniques as a means [Cristian 1989]. Examples of such applications are complex systems involving human beings, COTS components, external devices, several organizations, movement of goods, operations on the environment, real-time systems that do not have time to go back. Integrated Web services clearly fall into this category. A typical example that shows the need for a specialized fault tolerance mechanism for the Web services architecture is the issue of running distributed transactions over several autonomous Web services. As an illustration, Figure 1.5 depicts a sequence diagram example associated with the travel agent service. We consider that for a given trip request (expressed by giving the journey dates and the location), the travel agent finds a list of available hotel rooms and flights to the destination. Then, the user selects a complete trip by choosing a hotel and a flight from the list of trips. Once the user has confirmed the booking, the travel agent makes the hotel reservation, which completes successfully and then, attempts for the flight booking. The flight booking Web service returns an error, informing that the server is busy, and after several attempts, the server returns a NoFlightAvailable error message. Several solutions can be applied to handle this type of faults:

• The hotel reservation can be cancelled -if possible. The user is then informed that this trip is no longer available and he/she can retry booking by selecting another trip (e.g., by changing the dates of the journey).

• Alternative transport means instead of flight can be proposed to the user (e.g., a train ticket to a close city and renting a car).

These examples reflect two different fault tolerance mechanisms: (i) backward error recovery with cancellation/compensation and retry, and (ii) forward error recovery with an application-specific exception handler that handles the error without necessarily trying to restore the system state back. Developing fault tolerant mechanisms for composite Web services has been an active area of research over the last couple of years. Existing proposals mainly exploit backward error recovery, and more specifically, transactions. However, the autonomy of Web services and the Web latency have led to exploit more flexible transactional models and forward error recovery techniques, as discussed in the next two sections.

Figure 1.5 Sequence diagram for the travel agent service

1.2.2 Backward Error Recovery for the Web Transactions have been proven successful in enforcing dependability in closed distributed systems and are extensively exploited for the implementation of primitive (non composite) Web services. However, transactions are not suited for making the composition of Web services fault tolerant in general. Work has in particular been undertaken in the area of distributed transaction protocols supporting the deployment of transactions over the Web, while not imposing long-lived locks over Web resources. We discuss here the two main proposals aimed at the Web services architecture: (i) the Business Transaction Protocol (BTP) [OASIS-BPT], and (ii) Web Services Transaction (WS-Transaction) [WST].

16

Figure 1.6 BTP Atomic Business Transaction

BTP introduces two different transaction models for the Web: (i) the atomic business transactions (or atoms), and (ii) the cohesive business transactions (or cohesions). A composite application can be built from both atoms and cohesions that can be nested. In the atomic business transaction model, several processes are executed within a transaction and either all complete or all fail. This is similar to distributed ACID transactions on tightly coupled systems. However, the isolation property is relaxed and intermediate committed values can be seen by external systems (i.e., systems not enrolled in the transaction). Figure 1.6 illustrates the atomic business transaction model using the travel agent service involving a flight booking Web service (Flight) and an accommodation booking Web service (Hotel). In this scenario, the hotel room booking fails while the flight booking succeeds, which leads to the cancellation of the booked flight before the end of the transaction:

• Travel Agent sends the request messages to Flight and to Hotel Web services. • Flight and Hotel respond (Confirm messages) with listings of available flights and hotel rooms. • Travel Agent orders the bookings by initiating commitments (Prepare messages). • Flight Web service returns Prepared and is ready to commit, while the Hotel Web service returns a Fail

error message. Commit is no longer possible on the Hotel Web service for this transaction. • Travel Agent cancels the transaction on the Flight Web service by sending the Cancel order. • Flight Web service confirms cancellation with the Cancelled message.

The cohesive business transaction model allows non-ACID transactions to be defined by not requiring successful termination of all the transaction's participants for committing. A travel agent service scenario example for illustrating cohesive business transactions is given in Figure 1.7, where the flight booking is performed on two distinct Web services. In this example, the transaction, which was originally initiated with three participants, ends with two commits and one abortion:

• Travel Agent sends the request messages to the two flight booking Web services, Air France and British Airways and to the Hotel Web service.

• Web services return response messages to the Travel Agent. • Travel Agent selects Air France for the flight booking, and therefore sends a Cancel message to British

Airways Web service and a Prepare message to the two other Web services. • Air France and Hotel Web services acknowledge with the Prepared message and British Airways

confirms the cancellation with the Cancelled message. • Travel Agent confirms commits (Confirm messages). • Web services acknowledge (Confirmed messages).

17

Figure 1.7 BTP Cohesive Business Transaction

WS-Transaction [WST] defines a specialization of WS-Coordination [WSC], which is an extensible framework for specifying distributed protocols that coordinate the execution of Web services, and that can be used in conjunction with BPEL. Like BTP, it offers two different transaction models: (i) atomic transactions (AT) and (ii) business activity (BA). An atomic transaction adheres to the traditional ACID properties with a two-phase commit protocol. Note that as opposed to the BTP atomic business transactions, the isolation property is not relaxed in WS-Transactions, which as we mentioned before, is not suitable for the majority of Web service applications. The business activity protocol specifically serves coordinating the execution of open-nested transactions over a set of activities, through a coordinator activity. If there is a need for a coordinated activity to be compensated, the coordinator sends compensate messages to all the participants involved in the activity. Then, each participant replies by sending back either a compensated or a faulted message, depending on whether the required compensation operation was successfully completed or not. However, there is no requirement for an agreement on the outcome, and any participant can leave the coordinated activity in which it is engaged, prior to the termination of peer participants. A WS-Transaction business activity example is shown in Figure 1.8, with an Airline Web service and a Hotel Web service:

• The Travel Agent initiates the Business Activity with the Flight and Hotel participants by sending the Request messages.

• The Flight and Hotel Web services enrol in the transaction (Register messages) by returning list of respective availabilities.

• Travel Agent initiates booking on the Flight Web service (Complete message). • The Flight Web service returns Completed to confirm commitment, while the Hotel Web service returns

an error message Faulted and can no longer commit the transaction. • In order to abort the whole transaction and restore the state, the Travel Agent sends a Compensate

message to the Flight Web service that has already completed the (sub-)transaction and a Forget message to the Hotel Web service.

• The Flight Web service compensates the committed transaction by cancelling the booking order and confirms with the Compensated message sent back to the TravelAgent.

• If the Flight Web service cannot compensate the booked operation, it returns an error message Faulted back to the Travel Agent.

• In this case, the Travel Agent sends a Forget message to the Flight Web service. The flight has been booked and cannot be cancelled.

18

Figure 1.8 WS-Transaction Business Activity

Although there is not yet a consensus on a standard protocol for managing transactions on the Web, various implementations of these protocols are already available: the JOTM1 transaction manager and Cohesions 2implement the BTP protocol, and the Collaxa Orchestration Server3 permits the use of the WS-Transaction protocol when composing Web services with the BPEL language. However, solutions to the dependable composition of Web services that use primarily transactions do not cope with all the specifics of Web services. A major source of penalty lies in the use of backward error recovery in an open system such as the Internet, which is mainly oriented towards tolerating hardware faults but poorly suited to the deployment of cooperation-based mechanisms over autonomous component systems that often require cooperative application-level exception handling among component systems. Even with the use of cohesions, which does not necessarily roll back all participant states in case of abortion of one of them, they do not specify any specific way of providing fault tolerance, so everything is left to programmers. Moreover, cancellation or compensation does not always work in many real-life situations, which involve documents, goods, money as well as humans (clients, operators, managers, etc.) and which require application-specific error handling. 1.2.3 Forward Error Recovery for the Web Forward error recovery, using an exception handling mechanism is extensively exploited in the specifications of composite Web services in order to handle error occurrences (e.g., [BPEL4WS], [BPMI], [WSCI]). For instance, in BPEL, exception handlers (referred to as fault handlers) can be associated to a (possibly nested) activity so that when an error occurs inside an activity, its execution terminates, and the corresponding exception handler is executed. However, when an activity is defined as a concurrent process and at least one embedded activity signals an exception, all the embedded activities are terminated as soon as one signalled exception is caught, and only the handler for this specific exception is executed. Hence, error recovery actually accounts for a single exception and thus cannot ensure recovery of a correct state. The only case where correct state recovery may be ensured is when the effect of all the aborted activities are rolled back to a previous state, which may not be supported in general, in the context of Web services, as discussed previously. The shortcoming of BPEL actually applies to all XML-based languages for Web services composition that integrate support for specifying concurrent activities and exception handling. In general, proposed solutions are low-level ones and lack structuring techniques for building composite services in a dependable way. A solution to the above issue lies in structuring the composition of Web services in terms of Coordinated Atomic actions (CA Actions). As discussed in Deliverable PCE1, this has led us to introduce the notion of Web Services Composition Action (WSCA) that differs from CA actions in relaxing the transactional requirements over external resources (which are not suitable for wide-area open systems) and the introduction of dynamic nesting of WSCAs (i.e., nested calls of WSCAs for the sake of modularity). To recall the base principles of WSCA, the

1 http://www.objectweb.org/jotm/ 2 http://www.choreology.com/ 3 http://www.collaxa.com/

19

travel agent service is used to illustrate how WSCAs can be applied for specifying the composition of Web services. We consider joint booking of accommodation and flights using separate hotel and airline Web services. Then, the composed Web service's operation is specified using WSCAs as follows. The top-level TravelAgent WSCA comprises the User and the Travel participants; the former interacts with the user while the latter achieves joint booking according to the user's request through call to the WSCA that composes the Flight and the Hotel participants. Figure 1.9 depicts the software architecture of the travel agent service where rectangles represents components which can be either WSCA participants or Web services, and elliptical nodes represents connectors such as WSCAs connectors and SOAP connectors.

Figure 1.9 WSCA architecture

Figure 1.10 WSCA for composing Web services

A diagrammatic specification of the WSCAs is shown in Figure 1.10. In TravelAgent, the User participant requests the Travel participant to book a flight ticket and a hotel room for the duration of the given stay. This leads the Travel participant to invoke the JointBooking WSCA that composes the Hotel Web service and the Airline Web service. The participants of the JointBooking WSCA respectively requests for a hotel room and a flight ticket, given the destination and departure and return dates provided by the user. Each request is subdivided into reservation for the given period and subsequent booking if the reservation succeeds. In the case where either the reservation or the booking fails, the participant raises the unavailable exception that is cooperatively handled at the level of the JointBooking WSCA denoted by the greyed box in the figure. If both participants signal the unavailable exception, then Travel signals the abort exception so that the exception gets handled by TravelAgent in a cooperation with the User (e.g., by choosing an alternative date). If only one participant raises the unavailable exception, cooperative exception handling includes an attempt by the other

20

participant to find an alternative booking. If this retry fails, the booking that has succeeded is cancelled and the abort exception is signalled to the calling TravelAgent WSCA for recovery with user intervention. 1.2.4 Discussion Compared to the solutions that introduce transactional supports for composite Web services, using WSCAs mainly differs in that it exploits forward error recovery at the composition level, while enabling exploitation of transactional supports offered by individual Web services -if available. Hence, the underlying protocol for interaction among Web services remains the one of the Web services architecture (i.e., SOAP) and does not need to be complemented with a distributed transaction protocol. Similarly to this solution, the one of [Mikalsen 2002] does not require any new protocol to support distributed open-nested transactions. An open-nested transaction is declared on the client side by grouping transactions of the individual Web services, through call to a dedicated function of the middleware running on the client. The transaction then gets aborted by the middleware using compensation operations offered by the individual Web services, according to conditions set by the client over the outcomes of the grouped transactions. The solution offered by WSCA is then more general since it allows forward error recovery involving several services to be specified at the composition level, enabling in particular to integrate non-transactional Web services while still enforcing dependability of the composite service and partial results of a nested action to be reported to the higher level action.

1.3. From Formal Specification of Composite Web Services to Development Support

Deliverable PCE1 introduced a preliminary XML-based language, called WSCAL, for the specification of composite Web services based on WSCA (PCE1-Chapter 1), together with the formal specification of CA Actions using the B method (PCE1-Appendix A). A major focus of our work during the reporting period has been on revising the definition of WSCAL according to the formal specification of WSCAs, which derives quite directly from the one of CA Actions, as well as according to latest work done within related consortia that were discussed in the previous section. This leads us to introduce a definition of WSCAL based on the formal semantics of associated notions [Tartanoglu et al. 2003]. As such, this gives a sound base ground for reasoning about the behaviour of composite Web services developed using WSCAL but also to devise a robust implementation of the associated runtime support that is correct with respect to the B specification. This section is organized as follows. Section 1.3.1 details the development and deployment processes associated with WSCA-based composite Web services. Section 1.3.2 introduces core elements of the revision of WSCAL. Finally, Section 1.3.3 discusses exploitation of the B formal specification for the implementation of robust run-time support for the execution of services developed using WSCAL. 1.3.1 Building WSCA-based Composite Services Given the notion of WSCA, composite Web services are built by specifying their operations as WSCAs. The Web services that are composed are further declared in an abstract way to enable more flexible integration of different Web service instances through dynamic binding. The development process of a WSCA-based composite service then comprises the definitions of: (i) the required abstract interfaces of the composed Web services that are to be integrated, and (ii) the composition process that gives the binding information as well as the WSCAs’ standard and exceptional behaviour . The composite Web service is then deployed on top of a Web service platform associated with a local WSCA runtime support. Service Integration: Required interfaces of components services to be integrated are declared. The abstract interface of any such service is given in terms of WSDL (limited to the abstract part), which defines the messages exchanged with the Web service. It is worth noting that to enable a Web service to be correctly integrated in a composition process, Web services conversations are to be provided in addition to WSDL interfaces. The W3C Choreography Working Group aims at extending the Web services architecture with a standard for specifying Web services conversations, which includes the WSCI and WSCL proposals that were discussed in the previous section. We therefore extend the abstract interface by providing the documents specifying the protocols of interactions assumed by the service (as given by the WSCL or WSCI documents). The latter is optional although it is advisable to provide it for more rigorous specification of services, allowing in particular enforcing a consistent interaction protocol with the service. The service interface is further enriched with the characterization of the service’s transactional behaviour if offered, in a way similar to existing solutions in the area, e.g., [Mikalsen 2002, W3C-WSCI]. Composition Process: The composition process of WSCA-based composite services is expressed as an independent XML document, which specifies the service instances to be integrated together with provided

21

WSCAs, using WSCAL. The Web service instance associated to a given composed service may be either statically set or dynamically retrieved according to the service’s abstract specification. WSCAs define the operations provided by the composite Web service. The definition of a WSCA specifies the behaviour of the WSCA’s participants including the exception handlers that are executed when an exception occurs, which is characterized by fault elements in the Web services architecture. The abstract definition of the interfaces of the composed Web services given in a WSDL document includes the error messages that may be signalled by the Web service’s operations through the wsdl:fault element. These faults are reported as Fault Detail elements or as SOAP headers in the SOAP response messages. Note that SOAP messages can include additional fault elements in addition to the ones that are described in the WSDL document such as interface faults due to a wrong call of the operation (e.g., bad parameter types, or faults related to the mustUnderstand attribute field of SOAP request messages). Additionally, different kind of faults can occur and can be detected at the participant level (e.g., a timeout during a Web service invocation, a local service failure, or an HTTP error). In all cases, all these faults are mapped to an exception raised by a WSCA participant that invoked the Web service. The exception that has been raised, if it cannot be handled locally by the participant, is propagated to all the WSCA’s participants, which execute the associated cooperative exception handler.

Figure 1.11 WSCA Deployment

Deployment of WSCA-based Composite Services: WSCA-based composite services are deployed as Web services on top of a base Web service middleware (see Figure 1.11). Services that provide some WSCA composite services are referenced in a UDDI Web services registry for client use. Given base middleware for Web services, supporting the deployment of WSCA-based composite Web services requires in addition, a local runtime support for running WSCAs. The runtime support should include mechanisms for WSCA creation, synchronization of participants, exception detection and propagation, and dynamic binding of Web services. The next section introduces the revision of WSCAL for the specification of WSCA-based composite Web services, which define the graph of interactions among Web services. The actual implementation of the service composition may then be decoupled from the specification of the composition process, leaving under the responsibility of the developer to check that his/her implementation conforms to the service’s specification. Alternatively, the implementation may be generated from the specification. 1.3.2 Specifying Composite Services using WSCAL The specification of a composite Web service is given by the WSCAL (Web Service Composition Action Language) XML-based language, which includes declarations of the service instances and the behaviour of the supported operations:

<WSCAL name=nmtoken> <services> Detailed below </services> <WSCA operation=nmtoken exceptionTree=anyURI>* Detailed below </WSCA> </WSCAL>

Service Instances: The service instances are given through the services element that refers to the corresponding abstract interfaces of composed Web services:

<services> <service name=qname>* <definition hrefSchema=anyURI /> <staticService instance=anyURI />* | <dynamicService onCall=boolean multiple=boolean />?

22

</service> </services>

Each service may be statically bound to a specific service instance (defined by the staticService element) and/or dynamically bound to an instance matching the abstract definition of the service interface that is given by the corresponding definition attribute (defined by the dynamicService element). In the former case, concrete binding information is provided through the WSDL document associated with the service’s instance. In the latter case, a matching service instance is located at runtime using a location service such as a UDDI service. Dynamic binding of participants with associated Web services may take place either upon first invocation of the service’s WSCAs or upon instantiation of the composite Web service, according to the value of the onCall Boolean attribute of the given participant. Finally, we allow each service to be bound to a set of instances matching the specification of the associated service rather than a single instance for the sake of availability; this is specified using the multiple Boolean attribute in the DynamicService element and by stating as many instances as required with the staticService element. Then, a unique Web service that is available is chosen for the whole WSCA at the first invocation of the corresponding service. A sample of the services element for the TravelAgent composite service is given below. The service in particular offers the JointBooking WSCA that coordinates booking over the Hotel and Airline Web services, for which instances are dynamically retrieved upon invocation of the WSCA.

<services> <service name=“UserBrowser”> <definition hrefSchema=“http://ta.com/TAUser.req”/> <staticService instance=“http://ta.com/Client.wsdl”/> </service> <service name=“FlightService”> <definition hrefSchema=“http://ta.com/Flight.req”/> <dynamicService onCall=true multiple=true /> </service> <service name=“HotelService”> <definition hrefSchema=“http://ta.com/Hotel.req”/> <dynamicService onCall=true multiple=true /> </service> </services>

WSCA Behaviour: The behaviour of the operations offered by the composite service is defined by the WSCA element:

<WSCA operation=nmtoken exceptionTree=anyURI>* <participant name=nmtoken>* <bind service=qname/>* <input name=nmtoken ? message=qname/> <output name=nmtoken ? message=qname/> <fault name=nmtoken message=qname/>* <state>? <xsd:schema .../>* </state> <behavior> <standard> Statements ... </standard> <coordinatedHandler exception=qname>* Statements ... </coordinatedHandler> </behavior> </participant> </WSCA>

In the above, the operation attribute gives the name of the operation being specified and the exceptionTree attribute gives the XML document of the corresponding exception resolution tree. The exception tree is used to resolve the exceptions that are concurrently raised within WSCAs into a single exception, in a way similar to CA actions. Then, a sequence of participant elements are declared, each specifying:

23

• The services with which the WSCA participant interacts through the bind element that gives the name of the services among those defined by the service elements in the embedding services element.

• Parts of the messages associated with the WSCA that are relevant to the specific participant, which is defined using the input, output and fault elements. The fault elements define the exceptions that may be raised by the participants, which require cooperative exception handling and get composed with the exceptions concurrently raised by peer participants using the exception tree.

• The local state of the specific participant through the state element that defines the local variables. • The behaviour of the specific participant using the behaviour element. The participant behaviour

subdivides into the participant’s standard and exceptional behaviour (as defined by the standard and coordinatedHandler elements). Each such behaviour is defined as a process using classical statements, including in particular interaction with the Web service instance(s) associated with the participant, message exchanges with peer participants and exception handling.

• Classical statements that do not relate to exception handling and interaction with services are quite similar to the ones introduced by XML-based languages for the specification of composite services, and in particular to the former version of WSCAL. We thus detail here only base statements of interests regarding WSCAs, and that have been revised for some of them compared to the initial definition of WSCAL. The call statement allows specifying (synchronous) operation calls where the invoked operation may be either local to the embedding composite Web service or provided by the Web services to which the participant is bound (which may be a WSCA if the service is itself composite). The return statement is the dual statement allowing specifying the message to be returned as partial result of the embedded operations, which is to be merged with the results returned by peer participants. The send statement allows specifying the sending of a message to a peer participant whose dual reception may be expressed using either the blocking wait or the non-blocking onInput statement. Finally, the raise statement allows signalling an exception whose handling is specified using the traditional try statement for defining exception handling scopes:

<try>

Statements ... </try> <localHandler exception=qname>* Statements ... </localHandler>

Coordinated Exception Handling: The specifics of WSCAL come from structuring the operations provided by composite Web services as WSCAs that coordinate the execution of composed Web services operations with respect to failure occurrences, in particular introducing the specification of coordinated exception handling. For each participant, the fault elements define the exceptions that may be raised and require cooperative exception handling. When a participant raises an exception, it is first handled locally by a local handler, defined using the localHandler element, while other participants continue their execution normally without being interrupted. If the local handler fails, or if such a handler is not defined, the participant ultimately raises the exception, which is propagated to all participants, leading to coordinated exception handling, defined using the corresponding coordinatedHandler element. In the case where several participants raise exceptions that are concurrently propagated, then the exceptions are resolved using the document defined in the exceptionTree attribute, which imposes a partial order on all exceptions, and all participants handle this unique exception. The exception is then handled cooperatively by participants that can coordinate their execution in the same way as in the standard behaviour, i.e., by message passing, and they can invoke external WSCAs if needed. The WSCA that executes a coordinated exception handler terminates with one of the following outcomes [Xu et al. 1995]:

• Exceptional outcome: All external operations on Web services and nested WSCAs are performed successfully (e.g., all transactions are validated). An exception is signalled to the upper level: <raise exception=qname />

• Abort outcome: The WSCA has aborted and all external operations are aborted or compensated

according to the transactional behaviour of the accessed Web services as specified in the service interface. An abort exception is signalled to the upper level: <raise exception=qname type=abort />

• Failure outcome: The handling has failed and external operations cannot be aborted or compensated.

This can be due to a failure in performing the cancellation operation or if such an operation is not

24

defined. Responsibility for recovery is passed to the caller of the action. A failure exception is signalled to the upper level with information about failed operations: <raise exception=qname type=failure> Fault Details </raise>

Adopting the concept of coordinated exception handling of CA actions allows dealing with different types of faults in a disciplined way inside atomic units and partial results of nested actions to be reported to the higher level action, keeping error propagation under control. The Travel Agency Example: Still considering the travel agency example whose behaviour is graphically depicted in Figure 1.12, a sample of the participant element specifying the Hotel participants behaviour is given below. An example entry of the exception tree document is: <resolve name=“unavailable”> <exception name=“unavailableHotel”/> <exception name=“unavailableFlight”/> </resolve> <resolve name=“unavailable”> <exception name=“unavailableHotel”/> <exception name=“cancelled”/> </resolve>

Figure 1.12 Another composition for the travel agency

Flight participant’s behaviour is not detailed, as it is similar to the Hotel participant. Coordinated booking exploits open-nested transactions of participating Web services. The two participants of the JointBooking WSCA have similar behaviour. The participant first invokes the reservation operation of the Web service to which it is bound and then books the proposed selection -if any- through call to book. Otherwise, the unavailable exception is raised by the reservation operation, leading to retry an alternative reservation, and ultimately propagating the unavailableHotel exception for cooperative handling at the level of the WSCA. Finally, cooperative handling of unavailableHotel by the WSCA amounts to cancelling the performed booking by the peer participant -if any.

<WSCA operation=“JointBooking” exceptionTree=“http://ta.com/TAExcepTree.xml”/> <participant name=“flight”> Not detailed, similar to hotel participant given below </participant>

25

<participant name=“hotel”> <bind service=“HotelService”/> <input message=“InputMsg” /> <output message=“OutputMsg” /> <fault name=“unavailableHotel” message=“unavailableHotelMsg”/> <state> Not detailed </state> <behavior> <standard> <try> <comment text=“Try reserve a room”/> <call service=“HotelService” operation=“reservation” input = “...” output = “...” fault = “unavailableHotel”/> <comment text=“Book the room that was found”/> <call service=“HotelService” operation=“book” input = “...” output = “...” fault = “BookFailed”/> <comment text=“Return booking information”/> <return element=“...”/> </try> <localHandler exception=“unavailableHotel”> Retry booking, propagates unavailableHotel to all participants (for coordinated error recovery) otherwise: <raise exception=“unavailableHotel”/> </localHandler> </standard> <coordinatedHandler exception=“unavailableHotel”> <comment text=“Signal exception to calling action”/> <raise exception=“unavailableHotel”/> </coordinatedHandler> <coordinatedHandler exception=“unavailableFlight”> <comment text=“Compensate action”/> <try> <call service=“HotelService” operation=“cancel” input = “...” output = “...” fault = “CancelFailed”/> <raise exception=“cancelled”/> </try> <localHandler exception=“CancelFailed”> <comment text=“Signal exception to calling action”/> <raise exception=“CancelFailed” type=“failure”> Fault details </raise> </localHandler> </coordinatedHandler> <coordinatedHandler exception=“unavailable”> <raise exception=“unavailable”/> </coordinatedHandler> </behavior> </participant> </WSCA>

26

1.3.3 From Formal Specification to Implementation B is a complete formal method [Abrial 1996] that supports a large part of the development life cycle, from abstract specification to implementation. The B formal method is a model-based method, which is based on set theory and predicate logic and extended by generalized substitutions. B specifications are represented by abstract machines encapsulating operations and states. Generally speaking, the B method allows us to define abstract machines and refinements over them. During the refinement, non-determinism is reduced and preconditions are relaxed, but the interfaces of the operations remain the same. At the end of the refinement process, an implementation can be written, which corresponds to an executable code. Proofs are an essential part of the model: it should be proven that all operations preserve the invariants of the machine, and that the implementations and refinements preserve the invariants and the behaviour of the initial abstract machine. There are various tools that help writing and proving B specifications. The main of them are B-Tool and Atelier B. Both tools include a type checker, an animator, a proof obligation generator, theorem provers, code translators and documentation facilities. Atelier B has been used in our investigation, however the notation we used is compatible with B-Tool. Our goal in providing the B specification of WSCAs is to offer a general framework that can be instantiated to describe the implementation of a composite Web service that is developed using WSCA. The framework thus defines the dependability properties associated with CA actions, which will be enforced for any service based on them. In a way similar to the B formal specification of CA Actions (See Deliverable PCE1-Appendix A), the B formal specification of WSCAs is given by the CAACTIONS abstract machine. The machine extends machines OBJECTS and PARTICIPANTS, which respectively describe composed Web services that can be accessed or modified by a WSCA, and the participants of a WSCA (see Figure 1.13). The CONST machine further contains global declarations and is seen by all the other machines. The CAACTIONS abstract machine defines operations associated with the execution of WSCAs: creation, termination, interaction with Web services, message exchange between participants, and exception handling. Several invariant properties of the WSCAs have been identified and specified. They are written as constraints on the variables of the abstract machines. The state transformations associated with the execution of WSCAs are further defined by preconditioned operations. Invariant properties and operations are specified in such a way that properties of the dependability mechanisms associated with WSCAs are enforced.

Figure 1.13 Structure of the B specification

The WSCAL language hides the details of the dependability mechanisms such as automatic exception propagation, and more generally the behaviour of the operations described in the B specification and that will be executed during runtime. Furthermore, it enables static analysis to be performed in order to verify the structural properties of a given system described in the invariant of the specification. We are currently working on the implementation of the WSCAL runtime support from the B specification. In order to have an implementation of the language’s runtime support, the abstract machines are to be refined. At the end of the refinement process, we have a set of executable codes that correspond to the implementation of the operations defining the language statements, offered as a class library. Note that, when running WSCAs, some existing libraries such as SOAP-related functions are used. For all these libraries, what is usually known is the interfaces of the offered methods. In order to be able to prove the correctness of the implementation, it would be necessary to: (i) have the formal specification of the behaviour of these methods and (ii) prove that the refinements of the machines that use these methods are correct (in the B sense). During the refinement, the non-

27

determinism will be reduced (e.g., introduction of message queues). The pre-conditions have to be relaxed to take into account all these possible cases.

1.4. Enforcing atomicity of Composite Web Services The previous section has introduced a language for structuring composite Web services in a way that enforces dependability of the resulting services. It is then up to the developer to specify the exceptional behaviour of the service in the presence of faults, according to the specifics of the target application and in particular of the composed services. This task may further be assisted by providing additional built-in fault-tolerance mechanisms for error recovery according to the transactional behaviour of the composed Web services. In particular, we are interested in the inference of the actions to be performed in order to enforce semantic atomicity of composite Web services according to the services’ standard behaviour and the transactional behaviour of composed services (e.g., compensatable operations). Towards that goal, we have undertaken two complementary studies:

• Exploitation of transactional models offered for multi-database systems to specify the exceptional behaviour of composite services, according to the transactional behaviour of their components. Preliminary results of this study are reported in Section 1.4.1; proposed transactional models can be directly exploited by developers while it is our long-term objective to offer automated support for such a task through enhancement of WSCAL.

• Offering automated support for undoing actions performed over composed services, leading to automatically infer compensation actions according to the specification of the transactional behaviour of the composed services. In this context, we have started investigating work done in the area of collaborative work, which allows us to give some hints on how to transpose it into the framework of composite Web services, as reported in Section 1.4.2.

1.4.1. Guidelines for specifying forward error recovery based on transactional models for multi-

database systems A multi-database system (MDBS) [Breitbart et al., 1992] integrates a heterogeneous set of existing and autonomous databases (local DBMS) that provide a set of operations through their interfaces. Local DBMS are autonomous [Du and Elmagarmid, 1989] in the sense of that: (i) they remain unchanged (design autonomy), (ii) transactions are executed independently at each site (execution autonomy), and (iii) they are not supposed to cooperate with each other (communication autonomy). A global transaction is executed by a MDBS as a set of sub-transactions, each of which is a local transaction executed by a local DBMS. Hence, the issue of error recovery in MDBS naturally matches the one of composite Web services since composed Web services may be considered as local DBMSs from the standpoint of error recovery. The main difference lies in the fact that the interfaces considered for local DBMSs amount to primitive transactional operations (begin, commit, abort) while the interfaces of component Web services are complex service-oriented operations (e.g., booking a flight) and are not necessarily transactional. Transactional models for MDBS: Executing global transactions raises several issues [Breitbart et al. 1992]. First is related to ensuring global serializability. Second is the global atomicity problem that arises due to the autonomy of local DBMS, when a DBMS aborts a sub-transaction before the global transaction commits. Finally, there is a global deadlock problem. Several solutions are investigated in the literature to address these problems. For the serializability problem, solutions adopt either a pessimistic approach where the global transactions are delayed to avoid serialization graph cycles, or an optimistic approach where cycles are detected and broken by aborting global transactions. These approaches cause a low degree of concurrency or a large number of aborts. Furthermore, several correctness criteria (e.g., local serializability, two-level serializability) have been introduced to weaken the global serializability notion. For the atomicity problem, two types of actions exists:

• Forward recovery that ensures full atomicity (redo and retry), and • Backward recovery (compensate) that uses a weaker notion of atomicity.

The redo technique re-runs a new sub-transaction consisting of the failed write operations of the sub-transaction. The retry technique runs the entire sub-transaction that has aborted as a new sub-transaction. To provide full atomicity the sub-transaction must not have data dependencies with other sub-transactions of the global transaction and must eventually commit. The compensate technique consists of running sub-transactions in order to compensate effect of committed sub-transactions. Since effects of previously committed sub-transactions can be read by external transactions, the atomicity is said "semantic atomicity" [Garcia-Moline, 1983]. Semantic atomicity is ensured if all sub-transactions either commit or compensate. It does not guarantee that the state will be rolled back, but a semantically equivalent state can be reached. The compensating operation may be an inverse operation of the sub-transaction or a set of other operations. To provide full atomicity, effects of the compensated sub-transactions on other sites must be compensated as well and require thus semantic knowledge

28

of the application. Several transactional models combining both techniques (redo, retry and compensate) that provide semantic atomicity have been proposed:

• Sagas [Garcia-Molina and Salem, 1987] refer to transactions composed by a set of semantically atomic sub-transactions. A compensating transaction is specified for each sub-transaction that must be run if one sub-transaction of a Saga aborts. The concept of Saga is further extended in [Chrysanthis and Rammamritham, 1992] by introducing in the model sub-transactions that can be retried, replaced and ignored.

• Flex Transactions [Elmagarmid et al., 1990] and S-Transaction [Veijalainen et al., 1992] allow specifying alternate but functionally equivalent sub-transactions to be run when a sub-transaction aborts.

• [Mehrotra et al. 1992b] introduce global transactions that are composed of either compensatable or retriable sub-transactions, and of one pivot (non-compensatable and non retriable) sub-transaction. The compensatable sub-transactions must be committed before the pivot, which in turn must commit before the retriable sub-transactions. If the pivot aborts, all the compensatable sub-transactions are compensated for; otherwise, the retriable sub-transactions are attempted until they commit.

• Flexible transactions [Zhang et al. 1994] extend the above protocol by allowing multiple pivot sub-transactions to be specified. In this model, compensatable sub-transactions must commit pivot sub-transactions commits, which must precede retriable sub-transaction commits. In addition a partial order is specified on set of sub-transactions, each set defining an alternative execution for the transaction. Thus, when a sub-transaction aborts, some alternative sub-transaction execution can be chosen to complete the transaction.

• The ConTract model [Wachter and Reuter 1992] can be used to specify global transactions in a formal way. A ConTract is composed of atomic units of execution called "step". The behaviour of each step is specified by a script. Partial results of terminated steps can be viewed by external activities before the commitment of the ConTract. For each step a compensation step must be defined, which can be used when a step aborts and can define cascading compensations. When a recovery takes place, the ConTract can continue its execution from where it was interrupted. Consistency constraints are used by specifying invariants for concurrency control.

Forward error recovery in composite services based on MDBS transactions: The above transactional models for MDBS can conveniently be exploited by developers for specifying forward error recovery associated with composite Web services, using WSCAL, based on the transactional behaviour specified for each composed service. More specifically, the developer is able to choose among existing models, the one that is the most appropriate for a given composite service according to the specific behaviour of the composed services. In addition, it is our long-term goal to offer automated support for such a task through enhancement of WSCAL. We are in particular interested in offering automated support for undoing actions performed over services, leading to automatically infer compensation steps (using the terminology of the aforementioned ConTract model) according to the specification of the transactional behaviour of the composed services. In this context, we have started investigating a solution based on work in the area of collaborative work, as addressed in the next section. 1.4.2. From systematic undo in collaborative applications to semantic atomicity in composite Web

services Cancelling or reversing the effect of a former action is a necessity in most interactive systems. The simplest and most frequent form of this facility is the “undo” command that is available in our usual, individual, text or graphic editors. It provides a way of performing the so-called “linear undo”: Only the last action is undoable, and then the previous one, and so on; Moreover, there are cases where it is impossible to undo this last action. The implementation is based on some history buffer coupled with a “redo stack” (the history buffer behaving as a stack, as well). The “undoing” itself is realized either via state recovery or via the relevant reverse action. As soon as collaborative work is considered, as it is the case for distributed group editors, linear undo is no more suitable since the notion of a last action is not always meaningful. Within this framework, the so-called “selective undo”, which allows selecting and cancelling any (or rather some…) former action, has received a lot of attention [Karsenty and Beaudouin-Lafon 1993, Berlage 1994, Prakash and Knister 1994, Dix et al. 1997, Sun et al. 1998, Ressel and Gunzenhäuser 1999, Sun 2000]. There are some similarities between cooperative work and composite Web services: component Web services that are concurrently accessed may be treated as shared documents for undoing former actions as discussed below. Among the latest results on undoing in group editors, the transformational model presented in [Sun 2000] seems suitable for generalization to other kinds of distributed systems since it avoids completely state recovery and allows the selection and cancellation of any

29

former operation in the history buffer. We present some relevant aspects of this model, and then we give some hints on how to transpose it into the framework of composite Web services. Doing and undoing in collaborative work: The transformational model considers three meta-commands, namely do(O), undo(O) and redo(O), where O is an instance of any operation of the collaborative environment. We briefly presents the way do(O) is dealt with, since it is a necessary introduction to the way undo(O) is realised. The way redo(O) is performed is not addressed here because currently it does not seem to be of interest for composite Web services. Besides, we omit the classical issue of the preservation of causal ordering, which can be solved by well-known techniques [Raynal and Singhal 1996]. There is a distinction between the site where an operation O is generated and immediately executed, and the other sites where the do(O) command is received later on, very likely in a different context. An important notion is intention-preservation: The execution of a do(O) in a remote site must achieve the same effect as executing O at its original site, at the time of its generation. This is realized by some transformations of the parameters of O in order to take into account the difference of context, which is due to the fact that different operations may have been done at the receiving site. The classical example for introducing operation transformation considers the same initial state “abc” for two users. User 1 generates O1 = insert[2,X]. His or her intention is to insert X between “a” and “bc”. Concurrently, user 2 generates O2 = insert[3, Y]. His or her intention is to insert Y between “ab” and “c”. The execution of the command do(O2) when received by user 1 must transform O2 into O’2 = insert[4, Y], including in O2 the impact of O1, before executing it. This is achieved by the so-called “inclusion transformation” IT: O’2 = IT(O2, O1). Conversely, in some cases it is necessary to exclude from the operation Ox to be done the impact of some other operation Oy via the “exclusion transformation” ET(Ox, Oy). It is worth noting that very often IT and ET yield identity, as it is the case, for instance, for O1 when received by user 2. IT and ET must be defined for every couple of operations of the collaborative system. There are two straightforward generalizations to lists L of operations whose impacts must be included (LIT(O,L)) or excluded (LET(O,L)) before doing operation O. Undo is realised on the basis of the above transformations, under the basic assumption that for any operation O a reverse operation noted O is available. Let us consider the case where Undo(Oi) is generated or received at site k, with history buffer HBk = O1…OiOi+1…On. It will be performed by the execution of O’i obtained by transformation of Oi such that: O1…OiOi+1…On O’i has the same effect as O1…Oi Oi O’i+1…O’n The transformation of Oi into O’i includes the impacts of Oi+1…On, and the transformation of Oi+1…On into O’i+1…O’n excludes the impact of Oi. The details of the transformations are given in [Sun et al. 1998]. Thus, at run-time, the following treatments are performed:

- O’i is computed and executed, with O’i = LIT(Oi , HBk[i+1, n]) - The new history buffer is computed, namely: O1…Oi*O’i+1…O’n.

Oi is still there, but marked as a “do-undo pair”. This allows an efficient treatment of possible subsequent Redo(Oi). Undoing in composite Web services via operational transformations: In composite Web services, doing and undoing operations of component services is slightly easier because there is no distributed common document to maintain in a consistent state. However, there remains the problem that some operations requested by other sources can be performed by the component service. These operations must be taken into account when undoing even the last operation previously requested to a component service. Let us consider the following example: a composite Web service, CWS, sends operations O1c and then O2c to the WS Web service; between the executions of O1c and O2c by WS, WS performs operations O1o and then O2o requested by some other sources than CWS; moreover, after O2c, it performs O3o. Then it receives the command Undo(O2c) from CWS. The history buffer of WS is O1c O1o O2o O2c O3o. Following the transformational approach, if we assume that WS provide a reverse operation O2c for O2c and the IT and ET transformations, the undo command can be realised by

- Execution of IT(O2c, O3o), including the impact of O3o and - Transformation of the history buffer into O1c O1o O2o O2c* ET(O3o, O2c)

However, the operations of Web services are of a different nature than the operations of group editors. It turns out that IT and ET may require a bit more than modifications of parameters. Let us consider the following example from the travel agent case study: O2c is a successful flight reservation, and O3o is another reservation for the same flight. If O3o has been successful, then IT(O2c, O3o) = O2c and ET(O3o, O2c) is just O3o. If O3o has not been successful, then the pending reservation must be removed from the waiting list and satisfied, in order to get the same result as if O2c was never executed. In this case ET(O3o, O2c) consists in modifying the indication that O3o was not satisfied. We see that the history buffer also changes of content since it needs to

30

contain more than a simple list of executed operations: for instance, indications on the success or failure of the reservation operations. This model provides an elegant solution for undoing in distributed system under the following conditions:

- For any operations, there is a reverse one - Every component Web services manages an history buffer - Every component Web service defines IT and ET for all couples of operations.

These requirements correspond to a new example of the so-called “transactional attitudes”, which are presented as some essential ingredients for the composition of autonomous Web services by several authors [Mikalsen 2002, Pires et al. 2002] They are practicable if the vocabulary of operations of each Web services is not too large. Besides, there is a significant constraint of efficiency: the Web service must be locked during the computation of the transformations of the reverse operation and of the history buffer. Thus they must not take too much time. These conditions are likely to be fulfilled in many cases. This model and the corresponding transactional attitude seem worth to explore.

1.5. Conclusion The Web services architecture is expected to play a prominent role in the development of next generation, open systems of systems. In that context, the DSoS project has started investigating base solutions towards enforcing dependability of composite services. As discussed in previous Deliverable PCE1, the proposed approach lies in structuring composite services in terms of cooperative atomic actions, based on the CA Actions notion that was developed at the University of Newcastle, leading to introduce the WSCAL XML-based language for the specification of composite services. Work during the reporting period and summarized in this chapter has focused on consolidating our solution. In particular, the extensive research work in the area has led us to conduct a thorough survey of existing solutions so as to clearly establish where our contribution stands, including how it complements existing and emerging standards. We have further proposed improvement of WSCAL, compared to the version that was introduced in Deliverable PCE1, in the light of the formal specification of related notions. Finally, we have started investigating automated support for the inference of recovery actions to be performed in the presence of faults, according to the transactional behaviour of the composed services, so as to further assist the developer in his/her task.

31

Chapter 2 – Architectural Mismatch Tolerance Rogerio de Lemos (University of Kent), Cristina Gacek, Alexander Romanovsky (University of Newcastle) The integrity of complex software systems built from existing components or component systems is becoming more dependent on the integrity of the mechanisms used to interconnect these components and, in particular, on the ability of these mechanisms to cope with architectural mismatches that might exist between components. There is a need to detect and handle (i.e. to tolerate) architectural mismatches during runtime because in the majority of practical situations it is impossible to localize and correct all such mismatches during development time. When developing complex software systems, the problem is not only to identify the appropriate components, but also to make sure that these components are interconnected in a way that allows mismatches to be tolerated. The resulting architectural solution should be a system based on the existing components, which are independent in their nature, but are able to interact in well-understood ways. To find such a solution we apply general principles of fault tolerance to dealing with architectural mismatches.

2.1. Introduction Software architecture can be defined as the structure(s) of a system, which comprise software components, the externally visible properties of those components and the relationships among them [Perry & Wolf 1992] [Shaw & Garlan 1996]. A software architecture is usually described in terms of its components, connectors and their configuration [Medvidovic et al. 1999]: components represent computation units, connectors correspond to the communication protocols, and configurations characterize the topology of the system in terms of the interconnection of components via connectors. As a result of combining several architectural elements using a specific configuration, architectural mismatches may occur [Garlan et al. 1995]. Architectural mismatches are logical inconsistencies between constraints of various architectural elements being composed. An architectural mismatch occurs when the assumptions that a component makes about another component, or the rest of the system, do not match. That is, the assumptions associated with the service provided by a component are different from the assumptions associated with the services required by a component for behaving as specified [Medvidovic et al. 1999] [Gaudel et al. 2003]. These assumptions can be related to the nature of components and connectors (control and data models, and synchronization protocols), the global system structure, or the process of building the system [Garlan et al. 1995] [Shaw & Garlan 1996] [Gaudel et al. 2003]. Traditionally, mismatches have been dealt with statically [Egyed et al. 2000] [Gacek 1998] , by means of analysis and removal. For example, a formal approach has been advocated to uncover architectural mismatches in the behaviour of components, in particular, deadlocks [Compare et al. 1999] [Inverardi et al. 1997]. There are many reasons to support our claim that it is usually non-practicable to statically localize and correct all possible architectural mismatches, and because of this, we believe that it is vital to be able to build systems that can tolerate such mismatches. This is mainly due to the complexity of modern systems and restricted applicability of the static methods of correcting mismatches (c.f. software design faults). First of all, complex applications have complex software architectures in which components are interconnected in complex ways and have many parameters and characteristics to be taken into account, and they have to meet many functional and non-functional requirements that often have to be expressed at the level of software architecture. Secondly, architects make mistakes while defining software architectures, in general, and while dealing with mismatches, in particular. Thirdly, there is a strong trend in using off-the-shelf elements while building complex applications and because of the very nature of such elements some information about their architectural characteristics may be unavailable. Lastly, modern software systems are to be open, flexible and adaptive, and they may undergo dynamic reconfiguration (often by incorporating new components knowledge about which is not available offline), adding uncertainty about the various architectural elements present at any point in time. Instead of dealing with architectural mismatches during development time, which is the conventional approach, this chapter shows how these mismatches can be tolerated during runtime at the architectural level. The rest of the chapter is structured as follows. Section 2.2 discusses architectural mismatches in the context of features that are associated either with the architectural elements or the application being represented by these elements. In Section 2.3, we present some basic dependability concepts that provide the basis for the following discussion (in Section 2.4) on architectural mismatches from the perspective of system dependability. In Section 2.5, we address the notion of mismatch tolerance by discussing in more detail its basic activities. Section 2.6 presents

32

several simple examples that demonstrate the proposed approach. Finally, Section 2.7 concludes with a summary of the contribution and a perspective of future research.

2.2. Background To understand better the ways of tolerating architectural mismatches we will look first into specific characteristics of the individual architectural elements to be composed into a system, as well as into the reoccurring architectural solutions (i.e. architectural styles) applied for building system architectures. 2.2.1. Architectural Features The architectural mismatches occur because of inconsistencies among given architectural elements. These inconsistencies can be stated in terms of the features (i.e. characteristics or properties relevant to system composition) exhibited by the architectural elements to be integrated into the system. Such features have proven to be very useful for static mismatch detection [Gacek 1998]. A considerable number of such features have been determined while studying system composition from the viewpoint of detecting architectural mismatches during system development to allow systems to be corrected by removing mismatches. Concurrency, distribution, supported data transfers (e.g. via shared data variables, explicit data connectors, shared repositories, etc.), dynamism (system ability to change dynamically its topology), encapsulation (provision of well-defined interfaces), layering, backtracking, and re-entrance are some of the examples of the architectural features relevant to possible mismatches. A very useful source of such features can be found in research on online system upgrading, where, for example, additional (meta-) information describing component behaviour is used to deal with component interface upgrades [Jones et al. 2002]. Another dimension of the analysis proposed in [Gacek 1998] is relevance of such features to the particular architectural styles (such as pipe-and-filter, blackboard, etc.) employed in building system architecture: it is clear that some of these features are not applicable to some particular styles. For example, the pipe-and-filter style assumes multithreaded concurrency, no backtracking or re-entrance, while the blackboard style assumes backtracking, imposes no restrictions on types of concurrency and assumes no re-entrance. The overall idea here is that by analysing the characteristics of the architectural elements to be integrated and the styles from which these elements were derived, the system architects are able to localize architectural mismatches earlier in the life cycle. Some examples of architectural mismatches that can be detected by analysing the architectural features are [Gacek 1998]:

- Data transfer from a component that may later backtrack – this mismatch may cause undesired side effects on the overall composed system state.

- Call to a non-re-entrant component - this mismatch may happen when system composition is achieved via a bridging (triggered) call, and the callee is not re-entrant and is already running at the time of the call.

- Sharing or transferring data with different underlying representations - this mismatch happens when sharing or transferring data with different underlying representations, including differences in data formats, units and coordinate systems.

2.2.2. Style-specific and Application-specific Mismatches Architectural features of architectural elements and their groupings may be inherent to the architectural style(s) used, or specific to the application at hand. This occurs because architectural styles impose constraints on the kinds of architectural elements that may be present and on their configurations [Shaw & Garlan 1996], yet they do not prescribe all the features that may be present in an application [Gacek 1998]. During software development, the software architecture is incrementally refined following the refinement of the system definition. Initially, the software architecture is defined in terms of architectural styles, thus binding the style-specific features. Subsequently, as the architecture is further refined towards the life-cycle architecture, application-specific features are bound. This is exemplified on Table 2.1 (adapted from [Gacek et al. 1995]). In the following we will refer to the architectural features pertinent to particular architectural styles as style-specific features. A set of such features is defined in [Gacek 1998]. The features that are defined by the characteristics of the application to be developed but not by the architectural styles employed are called application-specific features. Every time an architectural feature is bound, there is a potential for an architectural mismatch to be introduced. Hence, we refer to architectural mismatches as being:

33

- style-specific - if their presence is brought about by some architectural feature(s) that the style(s) imposes, or

- application-specific - if their presence is due to architectural decisions imposed by the application at hand but not the particular style(s) used.

Early Cycle 1 End of Cycle 1 Cycle 2 Cycle 3 Definition of operational concept and

system requirements

Determination of top-level concept of operations

Determination of top-level concept

of operations

Determination of detailed concept

of operations

Determination of IOC

requirements, growth vector

Definition of system and

software architecture

System scope/ boundaries/ interfaces

System scope/ boundaries/ interfaces

Top-level HW, SW, human

requirements

Choice of life-cycle architecture

Elaboration of software architecture

No explicit architectural

decision

Small number of candidate

architectures described by architectural

styles

Provisional choice of top-level information architecture

Some components of

above TBD (low-risk and/or deferrable)

Binding of architectural

features

No architectural

features explicitly defined

Fixed architectural

features that are defined by

architectural styles, others are

unknown

Architectural features defined by architectural

styles are fixed as are some

application specific ones,

others are unknown

Most architectural features are

fixed, the few unknown ones

relate to parts of the architecture

still to be defined

Table 2.1. Refinement of software architecture under a Spiral Model Development

Identification of the nature of the architectural mismatches, as well as the nature of the architectural features causing these inconsistencies among architectural elements plays a vital role in developing approaches to tolerating such mismatches.

2.3. Dependability Dependability is a vital property of any system justifying the reliance that can be placed on the service it delivers [Laprie 1995]. The causal relationship between the dependability impairments, that is, faults, errors and failures, is essential for characterizing the major activities associated with the dependability means (fault tolerance, avoidance, removal and forecasting). A fault is the adjudged or hypothesized cause of an error. An error is the part of the system state that is liable to lead to the subsequent failure. A failure occurs when a system service deviates from the behaviour expected by the user. Fault tolerance is a means for achieving dependability working under assumptions that a system contains faults (e.g. ones made by humans while developing or using systems, and caused by aging hardware) and aiming at providing the required services in spite of them. Fault tolerance is carried by error processing, aiming at removing errors from the system state before failures happen, and fault treatment, aiming at preventing faults from being once again activated [Laprie 1995]. Error processing typically consists of three steps: error detection, error diagnosis and error recovery. Error detection identifies an erroneous state in the system. Error diagnosis assesses the damage caused by the detected error, or the errors propagated before detection. Error recovery transforms a system state that contains errors into an error free state. Recovery typically takes forms of either backward error recovery or forward error recovery. When the former is applied the system is returned to a previous (assumed to be correct) state; the typical techniques used are application-independent and often work transparently for the application (e.g. atomic transactions and checkpoints). Forward error recovery intents to move the system into a correct state using knowledge about the current erroneous state; this recovery is application-specific by its nature. The most general means for achieving it is exception handling [Cristian 1989]. Fault treatment consists of two steps: fault diagnosis and system repair. Fault diagnosis determines the causes of the error in terms of both location and nature. System repair consists of isolating the fault to avoid its

34

reactivation, reconfiguring the system either by switching on spare components or reassigning tasks among non-failed components, and reinitialising the system by checking, updating and recording the new configuration [Anderson & Lee 1981]. The process of repairing the system usually modifies its structure in order for the system to continue to deliver an acceptable service. Providing system fault tolerance plays an ever-growing role in achieving system dependability as there are many evidences proving that it is not possible to rid the system and system execution from faults. These include the growing complexity of software causing programmers’ bugs, operators’ mistakes, and failures in the environment in which the system operates.

2.4. Dependability and Mismatches In the context of dependability, an architectural mismatch is an undesired, though expected, circumstance, which must be identified as a design fault (in the terminology from [Laprie 1995]). When a mismatch is activated, it produces an error caused by mismatch (ECM) that can either be latent or detected. Similarly to errors, only a subset of ECMs can be detected as such (see Figure 2.1). Additional information is needed to allow an error to be associated with a mismatch. Eventually, there is a system failure when the ECM affects the service delivered by the system.

Fig. 2.1. Detected errors caused by mismatches

For describing the means for dealing with architectural mismatches, we draw an analogy with faults, which can be avoided, removed or tolerated. Faults are tolerated when they cannot be avoided, and their removal is not worthwhile or their existence is not known beforehand. The same kind of issues happens with architectural mismatches. Mismatches can be prevented by imposing strict rules on how components should be built and integrated. Mismatches can be removed when integrating arbitrary components by using static analysis methods and techniques [Gacek 1998]. However, this does not guarantee the absence of mismatches since risk and cost tradeoffs may hinder their removal, or system integrators may not be aware of their existence (similarly, research has shown that residual faults in software systems are inevitable). Consequently, mismatches should be tolerated by processing ECMs and treating mismatches, otherwise the system might fail. In the following, before presenting mismatch tolerance, we discuss in more detail what is mismatch prevention and mismatch removal. 2.4.1. Mismatch Prevention The approaches associated with mismatch prevention attempt to protect a component, or the context of that component, against potential mismatches by adding to the structure of the system architectural solutions. The assumption here is that the integrators are aware of all incompatibilities between system components. For example, if the architectural style of a particular component does not fit the style of the system in which is to be integrated, then a specialized wrapper can be developed as a means of fixing architectural incompatibilities [Perry & Wolf 1992]. There are three classes of structuring techniques for dealing with architectural mismatches, all of which are based on inserting code for mediating the interaction between the components [DeLine 1999]:

- Wrappers – which are a form of encapsulation whereby some component is enclosed within an alternative abstraction, thus yielding to an alternative interface to the component;

- Bridges – which translate some of the assumptions of the components interfaces. Different from a wrapper, a bridge is independent of any particular component, and needs to be explicitly invoked by an external component;

- Mediators – which exhibit properties of both wrappers and bridges. Different from a bridge, a mediator incorporates a planning function that results in the runtime determination of the translation. Similar to wrappers, mediators are first class software architecture entities due to their semantic complexity and runtime autonomy.

ECMs Detected ECMs

Errors

Detected Errors

35

2.4.2. Mismatch removal The approaches associated with mismatch removal are those that aim at detecting architectural mismatches during the integration of arbitrary components [Gacek 1998]. Existing approaches for identifying architectural mismatches are aimed for the development of software, either during the composition of components while evaluating the architectural options [Gacek 1998], or during architectural modelling and analysis [Egyed et al. 2000]. The Architect's Automated Assistant (AAA) approach uses automatic static analysis for performing early risk assessment for the purpose of detecting mismatches during component composition [Gacek 1998]. It is an approach that supports rapid evaluation of components with respect to potential incompatibilities among them. The software integrator gathers the information for the analysis, known as architectural features, from the system requirements and the specification of the components. On the other hand, the technique for architectural modelling relies on the specification of component invariants and services for analysing the architectural conformance of its components. For example, the behavioural conformance of the pre- and post-conditions of two components can be analysed using a model checking tool [Egyed et al. 2000] [Medvidovic et al. 1999]. The above two techniques, evaluation of architectural options and architectural modelling, are argued to be complementary because the former is able to detect mismatches very early during development, while the latter performs a more detailed and precise analysis of component mismatch. The techniques being proposed by these approaches are so specific to the context of software development that they cannot be transposed for runtime detection of error caused mismatches (ECMs). For example, how can we detect during runtime whether components have single or multiple threads, and how can we identify inconsistencies between the pre- and post-conditions among operations of interacting components? However, although it is difficult, in general terms, to relate the detection of errors to specific architectural mismatches that have caused them, it is nevertheless feasible to associate some (detectable) errors to architectural mismatches that may occur in the components’ behaviour, their interfaces, or interaction protocols. For example, a mismatch may occur in the naming of an operation or message, or in the number, ordering, type, and units of parameters [Taylor et al. 1996].

2.5. Architectural Mismatch Tolerance The main motivation for specifying mechanisms for tolerating architectural mismatches at the architectural level, instead of the implementation level, for example, is that the nature of mismatches and the context in which they should be fixed would be lost at the later stages of software development. Making an analogy with fault tolerance, it has been shown that the same type of problem exists when exception handling is not considered in the context of the software life cycle [de Lemos & Romanovsky 2001]. Moreover, we cannot expect that a general runtime mechanism would be able to handle a wide range of architectural mismatches, in the same way, as there is no sufficiently general fault tolerance mechanism that can handle all classes of faults. It is envisaged that different classes of architectural mismatches will require different types of detection mechanisms and fixes that have to be specified at the architectural level. Although the goal is to tolerate architectural mismatches at the architectural level, it is nevertheless necessary to deal with two levels of abstraction: the architectural level, where the mismatches are actually introduced, and the execution level, where ECM processing and mismatch treatment take place. ECM processing comprises three steps [Laprie 1995]:

- Detection of ECMs, which identifies erroneous states that are caused by mismatches; - Diagnosis of ECMs, which assesses the system damages caused by the detected ECMs; - Recovery from ECMs, which brings the system to an ECM-free state.

However, ECM processing is not sufficient if we would like to avoid the recurrence of the same architectural mismatch, so there is the need to treat mismatches, in the same way as faults are treated [Laprie 1995]. Mismatch treatment involves two major steps:

- Mismatch diagnosis, which determines the cause (localization and nature) of the ECM; - System repair, which prevents a new activation of the architectural mismatch; it is performed by

isolating the mismatch, and reconfiguring and reinitialising the system, in order to continue to provide an adequate, perhaps degraded, service.

The intent of fault tolerant techniques is to structure systems to inhibit the propagation of errors, and to facilitate their detection and the recovery from them. Similarly, when dealing with architectural mismatches, there is the need to structure systems at the architectural level in a way that prevents propagation of ECMs, facilitates ECM detection and recovery, and makes it difficult for the architectural mismatches to be reactivated.

36

In addition to system structuring, there is also the need for documenting architectural features of the system, as discussed in Section 2.2. This information is fundamental for distinguishing ECMs from other system errors, architectural mismatches from faults, and for choosing features suitable for tolerating style- and application-specific architectural mismatches. If little or no information is made available at the architectural level, either as interface properties of architectural elements or error codes, then this distinction cannot be characterized. For example, if there is no information about the types of data transferred between two architectural elements but the producer and the consumer assume different types (e.g. measurement units) the following situations are possible:

- An error is detected by the consumer but because there is not enough information it cannot be identified as an ECM, so unsuitable fault tolerance measures are applied (e.g. rollback);

- An error is further propagated outside the consumer and detected by other components. In this case without additional information it is impossible to identify the damage area to be recovered;

- The ECM is not detected and the system fails to deliver the service.

In order to provide the basis for defining an architectural solution for tolerating mismatches, in the rest of this section we present in more detail the activities associated with ECM processing and mismatch treatment. For each of the activities, we take into consideration whether architectural features, both style- and application-specific, are incorporated into the architectural description of a system. 2.5.1. ECM Processing As previously discussed, the detection of an ECM implies the presence of an architectural mismatch. The activation of a mismatch causing a system error depends on whether some conditions are satisfied, these conditions are related to inconsistencies in architectural features. In the following, we present in more detail the different activities associated with ECM processing. 2.5.1.1. ECM Detection Upon error detection, one must first determine whether that particular error can be identified as an ECM. For an error to be detected as an ECM we need additional information at runtime about the system states and the features of the relevant architectural elements that would enable to identify this particular error as an ECM. This ought to be done based on the detected error and on the presence of the conditions required for activating the architectural mismatch. The identification of an error as an ECM will facilitate the process of error recovery, in particularly if the error can be differentiated as being either caused by an application- or style-specific mismatch. For both types of ECMs, error codes should be provided as an outcome of a failed operation, and these codes should be related to architectural features of the system (as it will be seen in the examples in Section 2.6). Provision of an error code to an error caused by a style-specific mismatch could be related to the execution of an operation that violates the properties of an architectural notation, for instance, when in a non-re-entrant pipe-and-filter architecture a filter sends data to another filter that is already processing data from other source. On the other hand, provision of an error code to an error caused by an application-specific mismatch could be related, for example, to the semantic discrepancy of data received from other component; this error code should help, for instance, identify that the data received has the wrong type, such as, instead of receiving a value in meters, the value is in feet. Identification of a system error as an ECM is not essential if provisions are made in the later stages of mismatch tolerance for processing the error and treating the fault accordingly. However, the later an error is identified as an ECM or a fault as a mismatch, the more costly and more uncertain (mainly in its successful outcome) the respective processes of recovery and repair are. One of the techniques that can be used for detecting ECMs is executable assertions. 2.5.1.2. ECM Diagnosis The purpose of ECM diagnosis is to assess the damages caused by the detected ECM. During damage assessment it is necessary to identify all the erroneous states of the system before initiating recovery from the ECM, for this purpose there is no need to differentiate system errors from ECMs. If an ECM is not detected close to where it is activated, the propagation can render impossible the error recovery. This is usually the case for errors caused by an application-specific mismatch. The propagation of such ECM to other architectural elements depends on the encapsulation properties of the architectural language used to describe the system. Ideally the error should be contained within the component where the mismatch is activated. On the other hand, an error caused by a style-specific mismatch is more capable of affecting the whole architectural configuration of a system than a single component due to the lack of diversity in the style-specific features of the architecture. For example, in a blackboard architecture where only some of the components are able to backtrack, the impact of a component backtracking has to be assessed in the context of the whole system architecture to identify which

37

components’ states might have been affected by the backtracking. For both style- and application-specific mismatches, the process of damage assessment can be performed either by using static or dynamic techniques [Anderson & Lee 1981]. 2.5.1.3. ECM Recovery The purpose of ECM recovery (which can be one of the form: backward, forward or compensation, as well as their combination) is to replace at the architectural level an ECM state by an error-free state. The level of difficulty encountered for recovering from ECMs very much depends on the specific characteristics of the ECM, the application, and the error containment capabilities of the architectural style. In general terms, the type of ECM, whether style- or application-specific, should dictate the choice of recovery form. For errors caused by style-specific mismatches, backward recovery is more appropriate because they are application independent and require general approaches for recovering. If the architecture provides adequate error containment capabilities, ECM recovery may consist of eliminating existing erroneous states within an architectural element, this can be done by rolling back to an error-free state that the element had prior to the detection of the ECM. For example, if a component semantically checks the information it provides to other components for potential errors then it can be assumed that errors that might occur within the component are not propagated to the rest of the system. On the other hand, if the architecture does not provide adequate error containment capabilities, then the recovery at the architectural element level might not be sufficient, and there is the need to have a coordinated recovery involving several system components and connectors. For example, if a component needs to rollback and there are other components in the system that cannot rollback then some system coordination might be needed to rid the system of the ECMs. For errors caused by application-specific mismatches, forward recovery is more appropriate since knowledge about the application allows bringing the system into a new (correct) state from which the processing can resume. In particularly forward recovery in the form of exception handling can be used for dealing with those errors that are anticipated. For example, if a component detects a semantic discrepancy in the value of a variable that is transferred by other component with a different underlying representation, then the component can calculate a new value (assuming it knows the correct underlying context, which, again, can be documented in a form of corresponding architectural features), and resume normal processing. In those cases where we cannot distinguish whether the ECM is either style- or application specific, or even an error cannot be identified as an ECM, error recovery should follow a general approach based on backward error recovery. In these situations, as in all those in which not enough information is provided for supporting process of tolerating a mismatch, error recovery often becomes intrinsically complex. 2.5.2. Mismatch Treatment The treatment of mismatches aims to avoid mismatches from being further activated once their nature and location have been identified. As an activity following ECM processing, mismatch treatment attempts to avoid the re-activation of mismatches. If enough information regarding architectural features is made available as the interface properties of architectural elements, the process of tolerating mismatches might be reduced to mismatch repair. This can be achieved if, before any operation, architectural elements check for potential mismatches by requesting information about the architectural features of the other elements. After a potential mismatch is localized it should be repaired. For example, in a pipe-and-filter architecture, if a filter before sending its data checks for the status of the other filter and detects that the other filter is already receiving data from other source, then an alternative filter that is able to provide the same kind of services could be sought in the system. As we have already seen in the descriptions of previous activities, the treatment of mismatches depends on whether the relevant architectural features are style- or application-specific. For example, as we will show later, mismatches caused by incompatibilities in the style-specific features of an architecture often require more fundamental changes to the system architecture at hand. In the following, we present in more detail the different activities associated with mismatch treatment, considering again style- and application-specific mismatches. 2.5.2.1. Mismatch Diagnosis The purpose of mismatch diagnosis is to determine the cause of ECMs, in terms of both location and nature, which, in particular, means identification of the architectural elements that failed and the way they failed. This activity is fundamental for the process of mismatch repair since a clear identification of the mismatch is needed before any changes are made on the system architecture. The activity of diagnosis is complicated by the fact that it often requires a lot of information from the system and elaborate means to process this information. The types of information that are necessary: the detected erroneous state (which presumably is cause by a mismatch), the

38

overall state of the system when the ECM is detected, the configuration of the architecture, together with the available information on architectural features. The latter, in particular, provides the means for identifying the nature of the architectural mismatches. Although it is important to known whether a mismatch is style- or application-specific, the identification of the type of mismatch among a list of potential mismatches [Gacek 1998] is equally necessary for selecting the appropriate repair for the architecture. 2.5.2.2. Mismatch Repair The purpose of mismatch repair is to prevent mismatches from being activated again. Since each mismatch is caused by incompatibilities between features of architectural elements (mainly components), the repair of this mismatch can be performed by modifying the system structure. This architectural reconfiguration is performed in runtime, and it is not a simple task as in most cases it requires redundant architectural elements that are intrinsically diverse4 in the way they provide architectural features (both application- and style-specific), since mismatches are design faults. The reconfiguration can be performed in various forms: removal or/and addition of a single component, removal of all the components involved in a mismatch (e.g., a particular architectural style-specific mismatch), replacement of the connector linking the problematic components with a new connector with additional functionalities aiming at avoiding mismatches. For example, in the case of a component that is not able to rollback, this component can be replaced by other component that allows rollback, or an alternative connector can be provided that allows information to be buffered. The dichotomy between style- and application-specific mismatches for system repair is difficult to observe since for repairing some style-specific mismatches it is necessary to rely on application level mechanisms and techniques. In these cases, simple replacing an architectural element is not a viable option due to the lack of diversity in the features of the architectural style, which creates inherent difficulties in repairing some style-specific mismatches. In terms of application-specific mismatches, the repair mechanisms and techniques are essentially application related and as such should exploit available redundancies at the application level. Although the general aim of mismatch repair is to find and employ mechanisms and techniques that are sufficiently general to allow dealing with a wide range of mismatches, in real systems this is difficult to achieve because of three main reasons: mismatches of different types require different ways of reconfiguration and different types of redundant elements, very often not enough system redundancies can be made available, and the most effective way of performing repair is application dependent. Summarizing, in order for the system to tolerate architectural mismatches, it is crucial that the information associated with the architectural features (either style- or application-specific) is documented and encoded in the system in different forms, either as error codes for performing activities associated with ECM processing, or as interface properties for the activities associated with mismatch treatment. If enough information is made available, then the process of tolerating architectural mismatches becomes less complex and less prone to faults. In the following, we demonstrate through examples the different activities associated with mismatch tolerance, including the cases in which the whole process can be improved, sometimes by suppressing some of the activities, when suitable architectural features are exploited during runtime.

2.6. Examples This section demonstrates how mismatches can be tolerated following the framework discussed above. From the whole set of potential architectural mismatches discussed in [Gacek 1998], we have selected three mismatches, which are representative of the different types of mismatches and allow us to show different ways of tolerating them. In order to analyse the particularities associated with style-specific mismatches, the examples will be presented in the context of three architectural styles [Shaw & Garlan 1996]: pipe-and-filter, blackboard, and client-server. Our assumption here is that some of the non-functional properties/attributes of components (in particular, ones related to the architectural features) are published at their interfaces. Depending on the information available and on the way it is processed, we can distinguish three general scenarios in which architectural mismatches can be tolerated.

1. The first scenario falls into the category in which mismatch tolerance is restricted to mismatch repair. The basis for this scenario is the above assumption that features of the architectural elements are provided at their interfaces. The availability of this information allows for a component before engaging into an interaction, trying to identify potential mismatches. If an architectural mismatch is localized, then the activities associated with error processing and mismatch diagnosis can be ignored.

4 By diverse elements we mean here architectural elements that provide the same functionality but have different designs and implementations.

39

2. The second scenario falls into the category in which mismatch tolerance starts with ECM detection. During the interaction between two components when one of them returns an error diagnosed as an ECM, there is no need for the system to perform error diagnosis. Once again, for this to be possible it is necessary that additional information, about its architectural features, be made available at the interface of the architectural components.

3. The third scenario falls into the category in which mismatch tolerance starts with detection of a system error. In this scenario, there is no additional information available about the architectural features. Hence, there is a need to perform error diagnosis to identify the nature of the error as being an ECM, for that, additional information is needed about the state of the system.

In the following, we discuss how three different mismatches can be tolerated in the context of the above three scenarios. For each example of architectural mismatch, we present the type of mismatch, the characteristics of the architectural styles being used, and provide a small architectural configuration capturing the mismatch being discussed. Within this context, we proceed to explain, in detail, the different steps associated with the process of tolerating architectural mismatches. 2.6.1. Style-specific Mismatch As a style-specific mismatch, we consider the mismatch “call to a non-re-entrant component”. This mismatch happens when a component calls another component, and the latter may already be performing the requested execution and this execution is not re-entrant [Gacek 1998]. This mismatch will be analysed in the context of the pipe-and-filter style in which data is transformed by successive components. The components are the filters, which have almost no contextual information and retain no state between executions, and the connectors are the pipes, which are stateless [Shaw & Garlan 1996]. An example of the pipe-and-filter architectural configuration in which the above mismatch can occur is shown in Figure 2.2(a), when FilterA calls FilterC without waiting until the currently executed request from FilterB is completed. In this example we assume that a Unix environment is used, which can provide a runtime documentation related to the appropriate style-specific features. For implementing the first scenario we have to ensure that the resources/ports become exclusive before executing any interaction between the filters. Implementation of the second scenario assumes that FilterB always executes additional (application) code before connecting to FilterC. The third scenario is not applicable here because FilterB always receives information based on the Unix error while accessing a busy filter, so we will not consider it any further.

FilterA

FilterB

FilterC

FilterD

(a) before mismatch treatment

FilterA

FilterB

FilterC

FilterD (b) after mismatch treatment

Fig. 2.2 A non-re-entrant component in a pipe-and-filter architecture

This is how ECM processing and mismatch treatment (see Section 2.5) look like for the two remaining scenarios: a) ECM Detection:

• for scenario 1, no detection needed;

• for scenario 2, such an ECM is detected using the error code generated by Unix.

b) ECM Diagnosis:

• for scenario 1, no diagnosis needed;

40

• for scenario 2, during damage assessment there is no need to identify the nature, since the error code provides enough information.

c) ECM Recovery:

• for scenario 1, no recovery required;

• for scenario 2, different ways of recovering are possible depending on the application (e.g., backward or forward recovery, compensation, and recovery employing time redundancy).

d) Mismatch Diagnosis:

• for both scenarios, the error code should provide enough information to identify the location and the nature of mismatch;

e) Mismatch Repair:

• for both scenarios, system reconfiguration involves the usage of alternative component or connector. The way of reconfiguring the system depends on the application characteristics. Some possible scenarios are as follows: FilterB can switch to using an alternate FilterD (as shown in Figure 2.2(b)); it can create another instance of FilterC and switch to using it; it can use a different type of pipe (e.g. a timed pipe); in the situation when FilterB has a higher priority than FilterA, FilterC could be killed. In Unix environments, a script can execute this repair.

2.6.2. Application-specific Mismatch As an application-specific mismatch, we consider the mismatch “sharing or transferring data with differing underlying representations”. This mismatch occurs when communication between two components concerning a specific data cannot happen because the data being shared or transferred has different underlying representations, which might include, different data formats, units and coordinate systems [Gacek 1998] This mismatch will be analysed in the context of the client-server style, which is representative of data abstract systems in which a component – the server, provides services to other components – the clients. An example of the client-server architectural configuration in which the above mismatch can occur is shown in Figure 2.3(a), when Client1 requests a service from ServerA but provides a value in feet while the server requires it to be in meters. For implementing the two first scenarios, we assume that the application-specific features of both components contain information about the units being used. In the first scenario, this information is part of the interface of the server, while in the second scenario this information is part of the services provided by the server application. The implementation of the third scenario does not assume any additional information is provided. This is how ECM processing and mismatch treatment look like for the three scenarios: a) ECM Detection:


• for scenario 2, an ECM is detected from the application-specific features of the component;

• for scenario 3, a system error is detected.

b) ECM Diagnosis:


• for scenario 2, during damage assessment there is no need to identify the nature, since the error provides enough information.

• for scenario 3, a full damage assessment is necessary.

c) ECM Recovery:


• for scenarios 2 and 3, different ways of recovering are possible depending on the application (backward or forward recovery, or compensation).


41

• for scenarios 1 & 2, the error should provide enough information to identify the location and the nature of mismatch;

• for scenario 3, a full mismatch diagnosis is necessary.

e) Mismatch Repair:

for all the scenarios, system reconfiguration involves the usage of alternative component or connector. One of the ways to repair is for Client1 to request services of another ServerB (as shown in Figure 2.3(b)), which allows processing the request in feet. Another way to repair is to introduce a bridge that performs unit transformation.

Client1 ServerA

ServerB

(a) before mismatch treatment

Client1 ServerA

ServerB

(b) after mismatch treatment

Fig. 2.3 Components with different underlying representations in a client-server architecture

2.6.3. Style- and Application-specific Mismatch As an application-specific mismatch, we consider the mismatch “call or spawn from a subsystem that may later backtrack”. This mismatch occurs when after a component transfers data to other components, it backtracks, which might cause some undesired side effects. This mismatch will be analysed in the context of the blackboard style, which allows building an active repository for sharing and transferring data between clients that run of independent threads [Shaw & Garlan 1996]. Blackboard systems support backtracking, but they are neither re-entrant nor pre-emptive. An example of a blackboard architectural configuration in which the above mismatch can occur is shown in Figure 2.4, when Client1 attempts to backtrack, which is permitted by the BlackboardA, BlackboardB, and Client3, but not by Client2. For implementing the first and the second scenarios, it is necessary to have additional information on the ability of each component to backtrack, on the fact that backtracking is initiated by a component, and on a set of interconnected components to be involved in backtracking. In the first scenario, this information is part of the interfaces of the components, while in the second scenario this information is part of the services provided by the application. The implementation of the third scenario does not assume that any additional information is provided.

BlackboardA

BlackboardB

Client1

Client2

Client3

Fig. 2.4 Components that cannot backtrack in the blackboard architecture

This is how ECM processing and mismatch treatment look like for the three scenarios: a) ECM Detection:

42


• for scenario 2, an ECM is detected from the application-specific features of the components;

• for scenario 3, a system error is detected.

b) ECM Diagnosis:


• for scenario 2, during damage assessment there is no need to identify the nature, since the error provides enough information;

• for scenario 3, a full damage assessment is necessary, which might have affected all the system components.

c) ECM Recovery:


• for scenarios 2 and 3, only forward recovery is possible because recovery is application dependent (backward recovery does not apply because one of the components is not able to backtrack).


• for scenarios 1 and 2, the error should provide enough information to identify the location and the nature of mismatch;

• for scenario 3, a full mismatch diagnosis is necessary.

e) Mismatch Repair:

• for all the scenarios, system reconfiguration involves the usage of alternative component or connector; for example: an alternative component that allows backtracking, or a buffered connector to store data until there is no more risk of backtracking.

From the above exercise, we can draw several conclusions. If the appropriate information for dealing with potential mismatches (for example, documentation of the architectural features and availability of redundant architectural elements) is embedded into the system, right from its architectural conception, then the actual process of tolerating mismatches becomes much simpler. This includes both developing measures for mismatch tolerance and tolerating mismatches at runtime. The reasons for this are that on the one hand some of its activities, like diagnosis, that are complex, time consuming and prone to errors, cease to be necessary, but on the other hand important architectural information related to mismatches and their tolerance is lost during the following phases of the life cycle. Another conclusion is that, although the dichotomy for identifying the nature of mismatches (as style- or application-specific) is clear, the same cannot be said about the process of repairing them, since its techniques might require handling aspects that are particular to style and application. The assumption that architectural features should be part of the interfaces of the components can be weakened by employing other means for retrieving all information related to such properties, for example, using a reflective capability or a specialized registry for storing it.

2.7. Conclusions The problem of tolerating architectural mismatches during runtime can be summarized as follows. When an error caused by mismatch (ECM) is detected in the system, mechanisms and techniques have to recover the state of the system to an error free state, otherwise the erroneous state of the system can propagate, eventually leading to a system failure. However, the detection and recovery of an error is not enough for maintaining the integrity of the system services because if the mismatch, which has caused the detected error, is not treated, it can yet again be activated and be the cause of other errors. Similarly to fault tolerance in which one cannot develop techniques that can tolerate any possible faults, it is difficult to develop techniques that are able to deal with all types of architectural mismatches, hence assumptions have to be made about the types of mismatches that caused the errors to be detected and handled during runtime. In this chapter, we have mainly stated the problems and outlined a general approach to handling architectural mismatches during runtime. Our preliminary analysis shows that a number of particular mismatch tolerance techniques can be developed depending on the application, architectural styles used, types of mismatches, redundancies available, etc. It is clear for us that there will always be situations when mismatches should be avoided or removed rather than tolerated. Beyond the working examples discussed in the chapter, the

43

applicability of the proposed approach to real systems still remains an open issue. However, since the chapter advocates application of general fault tolerant mechanisms and techniques for handling architectural mismatches, the potential limitations of our approach are the same as those associated with traditional fault tolerance when applied to the systems of the same scale and complexity. This, in particular, concerns scalability of the techniques and the ways the systems are structured. In our future work we will be addressing these issues, trying to define in a more rigorous way the applicability of the approach and to develop a set of general mismatch tolerance techniques. Some of the possible approaches are to modify how existing architectural styles are applied, to design a set of connectors capable of tolerating typical mismatches, to extend existing components and connectors with an ability to execute exception handling, and to develop a number of handlers that are specific for mismatches of different types.

45

References

[Abrial 1996] J.-R. Abrial. The B-book: assigning programs to meanings. Cambridge University Press, 1996.

[Anderson & Lee 1981] T. Anderson, and P. Lee. Fault-Tolerance: Principles and Practice. Prentice-Hall Int. Englewood Cliffs, NJ. 1981.

[Avizienis et al. 2001] A. Avizienis, J.-C. Laprie, and B. Randell. Fundamental Concepts of Dependability. Technical Report 739. Department of Computing Science. University of Newcastle upon Tyne. 2001.

[Bass et al. 1998] L. Bass, P. Clements, and R. Kazman. Software Architecture in Practice. Addison-Wesley. 1998.

[Benatallah 2002] B. Benatallah, M. Dumas M-C. Fauvet, and F. Rabhi. Patterns and Skeletons for Parallel and Distributed Computing, chapter 10: Towards Patterns of Web Services Composition. Springer Verlag (UK), 2002.

[Berlage 1994] “A Selective Undo Mechanism for Graphical User Interfaces Based on Command Objects” by Thomas Berlage, ACM Trans. on Computer-Human Interaction, 1(3), 269-294, 1994.

[BPEL4WS] Microsoft and BEA and IBM. Business Process Execution Language for Web Services (BPEL4WS), Version 1.0, 2002. http://www.ibm.com/developerworks/ library/ws-bpel/.

[BPMI] BPMI.org. Business Process Modeling Language (BPML), Version 1.0, 2002. http:// www.bpmi.org/bpml.esp. [Breitbart et al., 1992] Y. Breitbart, H. Garcia-Molina, A. Silberschatz. Overview of Multidatabase Transaction Management. In VLDB Journal, 2, pp. 181-239, 1992, Invited contribution. H.-J. Schek, Editor.

[Campbell 1986] R. H. Campbell and B. Randell. Error recovery in asynchronous systems. IEEE Transactions on Software Engineering, SE-12(8), 1986.

[Casati 2001] F. Casati, M. Sayal, and M-C. Shan. Developing e-services for composing e-services. In Proceedings of CAISE'2001, LNCS 2068, 2001.

[Compare et al. 1999] D. Compare, P. Inverardi, and A. L. Wolf. “Uncovering Architectural Mismatch in Component Behavior”. Science of Computer Programming (33)2. 1999. pp. 101-131.

[Cristian 1989] F. Cristian. “Exception Handling”. Dependability of Resilient Computers. T. Anderson (Ed.). Blackwell Scientific Publications. 1989. pp. 68-97.

[Chrysanthis and Rammamritham, 1992] P. K. Chrysanthis and K. Rammamritham. ACTA: The SAGA continues. In A. K. Elmagarmid, editor, Database Transaction Models for Advanced Applications. Morgan Kaufmann, 1992.

[de Lemos & Romanovsky 2001] R. de Lemos, and A. Romanovsky. “Exception Handling in the Software Lifecycle”. International Journal of Computer Systems Science & Engineering 16(2). March 2001. pp. 167-181.

[DeLine 1999] R. DeLine. “A Catalog of Techniques for Resolving Packaging Mismatch”. Proceedings of the 5th Symposium on Software Reusability (SSR'99). Los Angeles, CA. May 1999. pp. 44-53.

46

[Dix et al. 1997] “The cube extending systems for undo” by AlanDix, Roberta Mancini, and Stefano Levialdi,1997.

[Du and Elmagarmid, 1989] W. Du and A. Elmagarmid. Quasi-serializability: A Correctness Criteria for Global Concurrency Control in InterBase. In Proc. 15th Int. Conf. on Very Large Data Bases, August 1989, pp. 347-355.

[Egyed et al. 2000] A. Egyed, N. Medvidovic, and C. Gacek. “Component-Based Perspective on Software Mismatch Detection and Resolution”. IEE Proceedings on Software 147(6). December 2000. pp. 225-236.

[Elmagarmid et al., 1990] A. K. Elmagarmid, Y. Leu, W. Litwin and M. Rusinkiewicz. A Multidatabase Transaction Model for InterBase. In proc. of the International Conference on Very Large Databases, pages 507-581, Brisbane, Australia, 1990.

[Fauvet 2001] M-C. Fauvet, M. Dumas, B. Benatallah, and H-Y. Paik. Peer-to-peer traced execution of composite services. In Proceedings of TES'2001, LNCS 2193, 2001.

[Florescu 2002] D. Florescu, A. Grunhagen, and D. Kossmann. XL: An XML language for Web service specification and composition. In Proceedings of the WWW'02 Conference, 2002.

[Gacek 1998] C. Gacek. Detecting Architectural Mismatches during System Composition. PhD Dissertation. Center for Software Engineering. University of Southern California. Los Angeles, CA, USA. 1998.

[Gacek et al. 1995] C. Gacek, A. Abd-Allah, B. Clark, and B. Boehm, “On the Definition of Software Architecture”. Proceedings of the First International Workshop on Architectures for Software Systems – In Cooperation with the 17th International Conference on Software Engineering. D. Garlan (Ed.). Seattle, WA, USA. April 1995. pp. 85-95.

[Garcia-Molina, 1983] H. Garcia-Molina. Using semantic knowledge for transaction processing in a distributed database. ACM Transactions on Database Systems, 8:186-213, 1983.

[Garcia-Molina and Salem, 1987] H. Garcia-Molina and K. Salem. Sagas. Proc. of ACM-SIGMOD 1987 International Conference on Management of Data, San Fransisco, 1987.

[Garlan et al. 1995] D. Garlan, R. Allen, and J. Ockerbloom, “Architectural Mismatch: Why Reuse is so Hard”. IEEE Software 12(6). November 1995. pp. 17-26.

[Gaudel et al. 2003] M.-C. Gaudel, V. Issarny, C. Jones, H. Kopetz, E. Marsden, N. Moffat, M. Paulitsch, D. Powell, B. Randell, A. Romanovsky, R. Stroud, F. Taiani. Final Version of DSoS Conceptual Model. DSoS deliverable CSDA1. 2003.

[Gray 1993] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.

[Inverardi et al. 1997] P. Inverardi, A.L. Wolf, and D. Yankelevich. “Checking Assumptions in Component Dynamics at the Architectural Level”. Proceedings of the 2nd International Conference on Coordination Models and Languages. Lecture Notes in Computer Science 1282. Springer, Berlin. September 1997. pp. 46-63.

[Jones et al. 2002] C. Jones, A. Romanovsky, I. Welch. A Structured Approach to Handling On-Line Interface Upgrades. Proceedings of the 26th Annual International Computer Software and Applications Conference (COMPSAC 2002). Oxford, UK. August 2002. IEEE CS Press. pp. 1000-1005.

47

[Jones et al. 2003] C. Jones, P. Periorellis, A. Romanovsky, I. Welch. Structured Handling of On-Line Interface Upgrades in Integrating Dependable Systems of Systems. In Proc. of FIDJI Conference, November 25-26, Luxemburg, 2002. To be published in LNCS, Springer. 2003.

[Kalyanakrishnan 1999] M. Kalyanakrishnan, R.K. Iyer, and J.U. Patel. Reliability of Internet hosts: a case study from the end user's perspective. Computer Networks, (31):47-57, 1999.

[Karsenty and Beaudouin-Lafon 1993] “An Algorithm for Distributed Groupware Applications” by Alain Karsenty and Michel Beaudouin-Lafon, IEEE ICDS’93 Conference, 193-202, 1993.

[Laprie 1995] J.-C. Laprie. “Dependable Computing: Concepts, Limits, Challenges”. Special Issue of the 25th International Symposium On Fault-Tolerant Computing. IEEE Computer Society Press. Pasadena, CA. June 1995. pp. 42-54

[Medvidovic & Taylor 2000] N. Medvidovic, and R. N. Taylor. “A Classification and Comparison Framework for Software Architecture Description Languages”. IEEE Transactions on Software Engineering 26(1). 2000. pp. 70-93.

[Medvidovic et al. 1999] N. Medvidovic, D. S. Rosenblum, and R. N. Taylor. “A Language and Environment for Architecture-Based Software Development and Evolution”. Proceedings of the 21st International Conference on Software Engineering (ICSE'99). Los Angeles, CA. May 1999. pp. 44-53.

[Mehrotra et al., 1992] S. Mehrotra, R. Rastogi, Y. Breitbart, H. F. Korth, and A. Silberschatz. Ensuring transaction atomicity in multidatabase systems. In Proc. of the 12th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. San Diego, CA, 1992.

[Mehrotra et al. 1992b] S. Mehrotra, H. F. Korth, and A. Silberschatz. A transaction model for multidatabase systems. In Proc. of Int. Conf. on Distributed Computing Systems, June 1992.

[Mikalsen 2002] T. Mikalsen, S. Tai, and I. Rouvellou. Transactional attitudes: Reliable composition of autonomous Web services. In DSN 2002, Workshop on Dependable Middleware-based Systems (WDMS 2002), 2002.

[Narayanan 2002] S. Narayanan and S. McIlraith. Simulation, verification and automated composition of Web services. In Proceedings of the WWW'02 Conference, 2002.

[OASIS-BTP] Oasis Committee. Business Transaction Protocol (BTP), Version 1.0, 2002. http://www.oasis-open.org/committees/business-transactions/.

[OASIS-UDDI] Oasis Committee. Universal Description, Discovery and Integration (UDDI), Version 3 Specification, 2002. http://www.uddi.org.

[Oberndorf et al. 1998] P. Oberndorf, K. Wallnau, and A. M. Zaremski. “Product Lines: Reusing Architectural Assets within an Organization”. Software Architecture in Practice. Eds. L. Bass, P. Clements, R. Kazman. Addison-Wesley. 1998. pp. 331-344.

[OMG 2002] Object Management Group (OMG), Chapter 21, "Portable Interceptors", of the "Common Object Request Broker Architecture (CORBA): Core Specification", Version 3.0.2, formal/02-12-02, Dec. 2002

[Perry & Wolf 1992] D. E. Perry, and A. L. Wolf. “Foundations for the Study of Software Architecture”. SIGSOFT Software Engineering Notes 17(4). 1992. pp. 40-52.

48

[Pires et al. 2002] “Building reliable web services compositions” by Paulo F. Pires, Mario R. F. Benevides, and Marta Mattoso, International Workshop on Web Services Research, Standardization, and Deployment - WS-RSD'02, 551-562, 2002.

[Prakash and Knister 1994] “A framework for undoing actions in collaborative systems” by A. Prakash and M. Knisler, ACM Trans. on Computer-Human Interaction, 4(1), 295-315, 1994.

[Raynal and Singhal 1996] “Logical time: capturing causality in distributed systems”by M. Raynal and M. Singhal, IEEE Computer Magazine 29(2), 49-56, 1996.

[Ressel and Gunzenhäuser 1999] “Reducing the problem of group undo” by M. Ressel and R. Gunzenhäuser, ACM conf. on Supporting Group Work, 131-139, 1999.

[Roseblum & Natarajan 2000] D. S. Roseblum, and R. Natarajan. “Supporting Architectural Concerns in Component Interoperability Standards”. IEE Proceedings on Software 147(6). December 2000. pp. 215-223.

[Ruiz-García et al. 2003] J.-C. Ruiz-García, M.-O. Killijian, J.-C. Fabre, and P. Thévenod-Fosse, “Reflective Fault-Tolerant Systems: From Experience to Challenges”, IEEE Transactions on Computers, Special Issue on Reliable Distributed Systems, Vol. 52, n°2, Feb. 2003.

[Shaw & Garlan 1996] M. Shaw, and D. Garlan. Software Architecture: Perspectives on an Emerging Discipline. Prentice-Hall. 1996.

[Sun 2000] “Undo any operation at any time in group editors” by Chengzeng Sun, Proc. of ACM conf. on Computer-Supported Cooperative Work, 2000.

[Sun et al. 1998] “Achieving convergence, causality-preservation, and intention-preservation in real-time cooperative editing systems” by C. Sun, X. Jia, Y. Zhang, Y. Yang and D. Chen, ACM Trans. on Computer-Human Interaction, 5(1), 63-108, 1998.

[Tartanoglu et al. 2003] F. Tartanoglu, V. Issarny, N. Levy, and A. Romanovsky. Formalizing Dependability Mechanisms in B: From Specification to Development Support. In Proceedings of the ICSE 2003 Workshop on Software Architectures for Dependable Systems. Portland, Oregon, USA. May 2003. To appear.

[Tartanoglu et al. 2003b] F. Tartanoglu, V. Issarny, N. Levy, and A. Romanovsky. Dependability in the Web Services Architecture. In Architecting Dependable Systems. LNCS 2677. 2003. To appear.

[Taylor et al. 1996] R. N. Taylor, N. Medvidovic, K. M. Anderson, E. J. Whitehead, J. E. Robbins, K. A. Nies, P. Oreizy, and D. L. Dubrow “A Component- and Message-Based Architectural Style for GUI Software”. IEEE Transactions on Software Engineering 22(6). June 1996.

[Veijalainen et al., 1992] J. Veijalainen, F. Eliassen, and B. Holtkamp. The S-transaction Model. In A. K. Elmagarmid, editor, Database Transaction Models for Advanced Applications. Morgan Kaufmann, 1992.

[W3C-WSCI] W3C. Web Service Choreography Interface (WSCI) 1.0, W3C Note, 2002. http://www.w3.org/TR/wsci/.

[W3C-WSCL] W3C. Web Services Conversation Language (WSCL) 1.0, W3C Note, 2002.

[WSC] Microsoft and BEA and IBM. Web Services Coordination (WS-Coordination), 2002. http://www.ibm.com/developerworks/library/ws-coor/.

49

[WST] Microsoft and BEA and IBM. Web Services Transaction (WS-Transaction), 2002. http://www.ibm.com/developerworks/library/ws-transpec/.

[Xu et al. 1995] J. Xu, B. Randell, A. Romanovsky, C. M. F. Rubira, R. J. Stroud and Z. Wu. “Fault Tolerance in Concurrent Object-Oriented Software through Coordinated Error Recovery” in Proc. the 25th International Symposium on Fault Tolerant Computing (FTCS-25), Pasadena, California, 1995, pp. 499 – 509.

[Yang 2002] J. Yang and P. Papazoglou. Web component: A substrate for Web service reuse and composition. In Proceedings of CAISE'2002, 2002.

[Zhang et al. 1994] A. Zhang, M. Nodine, B. Bhargava, O. Bukhres. Ensuring Relaxed Atomicity for Flexible Transactions in Multidatabase Systems. In Proc. of ACM-SIGMOD 1994 International Conference on Management of Data, Minneapolis, Minnesota, USA, 1987.

final results on architectures and ... - newcastle universityresearch.cs.ncl.ac.uk/cabernet/ ·...

Documents