cognitive behavior analysis framework for fault prediction ... · pdf filecognitive behavior...

Cognitive Behavior Analysis framework for FaultPrediction in Cloud Computing

Reza Farrahi Moghaddam&

Fereydoun Farrahi MoghaddamSynchromedia Lab, ETS

University of QuebecMontreal (QC), Canada H3C 1K3

Email: [email protected]@ieee.org

Vahid AsghariINRS-EMT

University of QuebecMontreal (QC), Canada H5A 1K6

Email: [email protected]

Mohamed CherietSynchromedia Lab, ETS

University of QuebecMontreal (QC), Canada H3C 1K3Email: [email protected]

Abstract—Complex computing systems, including clusters,grids, clouds and skies, are becoming the fundamental tools ofgreen and sustainable ecosystems of future. However, they canalso pose critical bottlenecks and ignite disasters. Their complex-ity and number of variables could easily go beyond the capacityof any analyst or traditional operational research paradigm.In this work, we introduce a multi-paradigm, multi-layer andmulti-level behavior analysis framework which can adapt to thebehavior of a target complex system. It not only learns and detectsnormal and abnormal behaviors, it could also suggest cognitiveresponses in order to increase the system resilience and its grade.The multi-paradigm nature of the framework provides a robustredundancy in order to cross-cover possible hidden aspects ofeach paradigm. After providing the high-level design of theframework, three different paradigms are discussed. We considerthe following three paradigms: Probabilistic Behavior Analysis,Simulated Probabilistic Behavior Analysis, and Behavior-TimeProfile Modeling and Analysis. To be more precise and because ofpaper limitations, we focus on the fault prediction in the paper asa specific event-based abnormal behavior. We consider both spon-taneous and gradual failure events. The promising potential ofthe framework has been demonstrated using simple examples andtopologies. The framework can provide an intelligent approachto balance between green and high probability of completion (orhigh probability of availability) aspects in computing systems.

I. INTRODUCTION

The computing systems, in various forms such as clusters,grids, clouds and skies [9], [10], [17], are scaling not only interms of number of involved components and their physicaldistribution, they are also becoming very heterogeneous withnew components such as sensors and mobile devices. Althoughthis brings more computational and conscious power, at thesame time it increases the degree of uncertainty and risk. Thereare many risk sources involved, such as operators (human),software bugs, software overload, hardware overload, hard-ware failure, among others [24]. Therefore, a full understand-ing of the system behavior, which brings the ability to predictits normal or abnormal behaviors, is of great importance forscheduling, allocation, binding, and other actions. We call itthe Behavior Analysis (BA), and we propose a framework withthree high-level units: the Behavior Analysis Unit (BAU), theBehavior Stimulator Unit (BSU), and the Cognitive Responder

Unit (CRU). A schematic example of the proposed frameworkis shown in Figure 1.

Fig. 1. A schematic example of the proposed framework in a sky system.

In a non-technical way, we can consider two high-levelmodes of transaction between a service provider and its clients:

1) Leasing: The provider dedicates an agreed set of re-sources to the client for an agreed limited period oftime. The lease can be renewed upon the agreement andresource availability. This mode is especially interestingfor handling those resources which may evanesce orvanish without a notice (such as those resources poweredby intermittent energy sources). The lease mode is agood practice for service providers with Infrastructureas a Service (IaaS) model or other similar models.

2) Task completion: The service provider guarantees com-pletion of a (or a volume of) task(s) within an agreedperiod of time. This implicitly implies that the provideris aware of the task’s detailed steps.

In reality, there is a chance of failure to deliver the agreedService-Level Objectives (SLOs). This brings us to two im-portant concepts: the Probability of Completion (PoC) andthe Probability of Availability (PoA). Ability of a providerto determine, estimate, or measure the PoC (or PoA, de-pending on its business model), not only enables him to

(a) The overall picture (b) The multi-layer nature of the frameworkFig. 2. Schematic diagram of the proposed behavior analysis framework in its systemic picture.

negotiate instrumental Service-Layer Agreements (SLAs) withits clients, it also provides a way to grade its various services.Especially, services with High Probability of Completion(HPoC) or High Probability of Availability (HPoA) gradeswould attract mission-critical clients, such as communicationproviders, and emergency operators. Usually, the HPoC (orHPoA) is achieved by resource “over allocating” and bytask “replicating.” This can be interpreted as a traditionalcorrelation between the HPoC (or HPoA) requirement andthe high level of energy/resource consumption (non-greenness)of a service. This is a critical issue because, with the pushfor the ICT enabling effect and also the move toward theInternet of things, the HPoC (or HPoA) will be required byan enormous number of clients; the ICT enabling effect is oneof fundamental instruments for reducing the footprint of otherindustrial sectors by re-directing non-ICT service calls to theICT sector [30]. The Internet of things is also becoming areality in near future because of exponential increase in thenumber of portable phones, distributed sensors, and RadioFrequency (RF) devices [20].

One way to break the correlation between the HPoC (orHPoA) and non-greenness of services is adding intelligenceto determine, predict and react to the possible changes inthe PoC (or PoA) in real-time. This could help a systemto provide a desirable HPoC (or HPoA) with a minimumamount of resources. This analyzer and its implementationis our ultimate research goal, and in this work, we presentan overview and some preliminary results. Calculation andverification of the PoC (or PoA) of a service can be carriedout based on analyzing the system configuration. However,real-time variation in PoC (or PoA) is very critical and canlead to violation of the SLA despite having a satisfactoryconfiguration-based predicted PoC (or PoA) value. Therefore,in our framework, we consider three paradigms to compensatethe weakness of each other one: Probabilistic (Statistical Infer-ence) Behavior Analysis, Simulated Probabilistic (StatisticalInference by Means of Simulation) Behavior Analysis, andBehavior-Time Profile Modeling and Analysis. The two firstparadigms work based on the configuration of the system usingthe data collected from the experiments and simulations toprovide insight on the system behavior. The third paradigm

uses machine learning techniques to learn the patterns andbehaviors of the system from its time profiles collected by aset of opportunistic agents across the system.

The organization of the paper is as follows. In section II,a brief introduction of the proposed framework is presented.Fault injection approaches are discussed in section III. Thethree behavior analysis paradigms and some experimental re-sults are presented in section IV. The related work is describedin section V. Finally, the conclusions and future prospects areprovided in section VI.

II. PROPOSED BEHAVIOR ANALYSIS

A schematic diagram of the proposed framework is pre-sented in Figure 2(a). The computing system under study isrepresented roughly by many involved layers on each oneopportunistic and cognitive agents of the framework will residein order to collect status and time-profiles. All the collectedinformation is directed to the main unit, the Behavior AnalyzerUnit (BAU). Using its multi-paradigm approach and based onthe collected data, this unit not only infers on the currentstatus of the system and its components, it also producespredictions on the changes in the system status or possibilityof occurrence of abnormal events. As the BAU works basedon machine learning techniques, it requires enough samplesof various behaviors under different conditions and operationsto build its inference models. To convert the learning processfrom passive to active, and to reduce the learning time, anotherunit, the Behavior Stimulator Unit (BSU), is considered whichis responsible to “stimulate” the desired behaviors in a con-trolled manner. The last part of the framework is a CognitiveResponder Unit (CRU) which makes recommendations tothe system in order to prevent or compensate the abnormalbehaviors/events and their side-effects in an optimal way andusing all available resources.

The framework considers the following three paradigms:Probabilistic Behavior Analysis, Simulated Probabilistic Be-havior Analysis, and Behavior-Time Profile Modeling andAnalysis. Each paradigm is considered to independently workand draw its inference. The CRU is supposed to combine(using voting, mixture of experts, stacking, cascading [1], orany other strategy) the conclusions of the three paradigms and

Fig. 3. Schematic diagram of the proposed behavior analysis framework inits ecosystemic picture.

make a cognitive decision. Therefore, the CRU cognition couldbe very different form one system to another one, dependingon the desired level of dependability, and it can float on awide spectrum from extremely pessimistic to highly optimisticcognitions.

Let’s consider an example of the BA application in up-grading a service grade. Assume, in a system, the Meantime between failures (MTBF) of the dominant fault is10 weeks and its Mean time to repair (MTTR) is 10 minutes.It reads to an average down time of 365/70 × 10 '52 minutes and 8.4 seconds per year which is gradedto 4 nines availability grade level [29]. If the BA frame-work can achieve a success rate of %90 in predicting faults15 minutes before their associated failure, downtime will bereduced 5.2 minutes per year which is graded to 5 ninesavailability grade, achieved without any extra investment incore hardware/software of the system and just by using the BAframework. This upgrade can not only increase the profit ofthe service provider and reduces the fee for the service user, italso reduces the footprint on the environment; services whichuse hardware with longer life span has lesser lifecycle footprinton the environment because of overall lowered manufacturingfootprint. This shows a great value of the BA framework, espe-cially the real-time BA paradigm. Although the BA frameworkis not limited to a specific behavior, we consider only analysisof the “fault” events in this work. Analysis of other types ofbehavior-related events, such as “degradation,” and also thesystem behavior itself will be considered in the future.

As can be seen from Figure 2(a), the framework works onseveral layers from hardware to applications. In each layer, amulti-level approach is considered for representation. At thelowest level, all the system components (even the network-ing links) at that layer are considered as objects forming agraph based on their functional connectivity to each other. Aschematic example of lowest level graphs of various layersis shown in Figure 2(b). Each graph hypothetically spreadsover the physical and non-physical location coordinates thatcan be used to incorporate location intelligence into theframework. High frequent cliques or sub-graphs on this level

forms the super-components which constitute the next level ofthe representation. The same process leads to higher levelsof representation. This brings a vertical scalability to theframework that helps in abstracting the behavior of a highnumber components using a few number of super-componentsat high levels. At the same time, this multi-level representationfacilitate horizontal (increasing the number of lowest-levelcomponents) scaling of the system as the scaling can be easilyabsorbed within the higher levels. In addition to horizontal andvertical scalability, the framework is capable to hierarchicallyor federally scale along the platform dimension. In this di-mension, the BA units of the behavior of each lower-levelplatform (ranging from rack, cluster, data center (node), cloudto sky) can be recapitulated and then aggregated with others’behavior, and then hierarchically passed to the BA units of thehigher-level platform, or federally shared among the BAs ofthe platforms at the same level. A schematic example of thisconcept of scalability from the cloud level to the sky level isshown in Figure 1, in which a skybus enables communicationsbetween cloud-level recapitulators and sky-level BA units.

Although the limited space of this paper does not allow fulldiscussion, we want to bring attention to another aspect of theproposed BA framework that arises when the complexity, num-ber of actors, and diversity of a system drastically increase.In these cases, the traditional picture of a “system,” especiallyits implicit “controllability” concept, no longer fits. Instead,a “manageability” concept, in the form of an “ecosystem”picture seems more appropriate. Obvious examples of theseecosystems are highly-penetrated systems into societies, suchas: i) cellphone networks and ii) “Y-to-the-home” (Y-TTH)networks. The Y-TTH concept can be seen as a transversalapproach compared to the traditional ”Fiber-to-the-X” (FTT-X) concept. While the FTT-X emphasizes on the depth offiber penetration in the premises, the Y-TTH focuses on thetouching access technology: metal, fiber, and wireless. Anexample of the Y-TTH (and at the same time, of the FTT-X)implementation is the Fiber-TTH (FTTH) Reggefiber companyin Netherlands [31].

In both cases of highly-penetrated systems into societies, theresulting populace of highly interactive actors, ranging fromend users to service providers and computing providers (whichincludes all types of computing resources, especially accessnetworks), forms an ecosystem of diverse actors. Althoughno governance is expected in these ecosystems, collaborativeliving among the actors and also alien (out of the ecosys-tem; for example, sourced from environmental regulationsor sustainability reporting requirements) governance could bethe base of manageability within the ecosystem [8], [5]. Ourgeneral picture of these computing ecosystems is provided inFigure 3. The main difference between our picture and thetraditional ecosystem-society picture is that we consider allactors, even the society (the end users), inside the ecosystem.It implicitly implies that society is a part of any ecosystem, andsocioeconomic footprint indicators should be also consideredalong with the environmental indicators. This picture enablesus to build closed loops within the ecosystem, and therefore

avoid any requirement or assumption on the ecosystem bound-ary conditions.

In our ecosystem picture, there are three major classes ofactors: end users, service providers, and computing providers.Although computing providers class could be actually con-sidered as a subset of service providers, it is considered asa separate class in order to emphasize on that fact that themost of management and governance is actually implementedand executed by these actors. In the other words, serviceproviders are considered as “free,” and probably selfish, actorswho play within the constraints of governances imposed bythe computing providers. The computing providers class ishighly general, and also includes active and passive operators,for example. The “transformations” actions between variousclasses, shown in Figure 3, illustrate the vague nature of theclasses, and represent transition of actors from one class toanother class based on their behavior. In other words, the real-time classification of an actor is performed by analyzing itsbehavior, and there is no official or assigned class assignment.For example, a CEO can be transformed from the end userclass to a service provider class when he uses his cellphoneto participate in providing a service to another actor.

The main BA requirement in the ecosystem is to profileend users and other actors based on their behavior, and usethem in provisioning and also grading the actors, among othergoverning actions. In the proposed ecosystem view, as theconcept of controllability no longer exists, the three units ofthe BA paradigm are redefined as follows. The core of theBA solution is still called the Behavior Analyzer Unit (BAU)but with a different mission: profiling the actors. Because ofsecurity and trust concerns, it is assumed that the main sourcesof behavior profiles available to the BAU are sourced from thecomputing providers, which are presumably its clients. Thesecond unit, The BSU, is replaced with the Actor SimulationUnit (ASU). The ASU provides required scenarios upon therequest from the BAU by generating imaginary actors in bothclasses of the end users and the service providers. Furthermore,the CRU is replaced with the Cognitive Advisory Unit (CAU)which provides cognitive advices to the computing providers,and potentially to the service providers, without any guaranteeof acceptance by those actors. It is worth noting that eachcomputing provider, or a collection of them, can still has aninternal BA solution at their systematic level. From here on,the BA framework in its system picture (shown in Figure 2)will be followed.

III. FAULT INJECTION AND PROPAGATION

In computing systems, having a certain degree of relia-bility and dependability is very important [3]. For instance,employing low-cost processor components or having softwarebugs can significantly affect their quality of service (QoS).Sophisticated fault testing techniques must be used to obtain aspecific dependability requirement in a system. Fault injectionis a technique that can validate the dependability of a systemby carefully considering injection of faults into the system andthen observing the resulted effects on the system performance

Fig. 4. The fault injecting scenario.

[15]. Indeed, this technique accelerates the fault-occurrenceand -propagation into the system. At the same time, it can beused to study the behavior of the system in response to faults,and also track back the behaviors to the failures.

A fault injection model for a typical system is illustrated inFig. 4, which consists of the following components:• Load Generator and Fault Generator: The load generator

generates workload of the target system and provides afoundation for injecting faults. The fault generator injectsfaults into the system as it executes commands from theworkload generator. The injector not only support differ-ent types of fault, it also controls their target componentand time based on the requests from the BAU and its ownfault library.

• Data Analyzer and Behavior Analyzer: The behavioranalyzer requests fault and failure scenarios in order tocomplete its models during the fault analysis experiments.Specifically, it tracks the execution of the requests, andincrementally improves its behavioral models. The dataanalyzer is responsible of handling the big data collectedfrom the system as a preprocessing unit. In addition, theopportunistic agents, which collecting data from variouscomponents of the system, trim the data before uploadingit to the BAU.

Two major categories of main common causes of failureare software faults [22] and hardware faults [16], [3]; almost60% of the system failures are caused by the software faults.Some fault injection schemes are designed to emulate onlysoftware failures, such as the JAFL (Java fault loader) [21].Also, there are some fault injection schemes that can emulateboth software and hardware failures, such as the CEDP (Com-puter Evaluator for Dependability and Performance) [32]. Inparticular, the JAFL is a fault injector scheme designed fortesting the fault-tolerance in grid computing systems. Whilemost of the similar class of fault injector schemes only focusedon the fault injection at basic level, such as corrupting the codeor data [7], the JAFL is a more sophisticated software faultinjector that considers a wide range of faults such as CPUprocessing usage, Memory usage, I/O bus usage, and etc. [21].On the other hand, faults can be also injected into the hardwareof the system. The CEDP is a fault injecting scheme developedfor quantitative evaluation of system dependability by testingsoftware and hardware of the system. This scheme is alsoable to characterize the behavior of the fault propagation inthe system. In the CEDP, hardware fault represents a transient

fault in the CPU register or in the memory block. Then, duringthe next execution of system program, this fault/error willpropagate and cause faults for the other system states. We willuse both these two fault generators in our future work. Havingthe fault injectors incorporated, the BSU could generate thedata and profiles required by the BAU to create distributionsand models.

IV. BEHAVIOR ANALYSIS PARADIGMS

In this section, we present the three paradigms that processand model the behavior data.

A. Probabilistic Behavior Analysis

The probabilistic (statistical inference) analysis, in whichthe reliability of a system is estimated along the time, is awell-known and popular approach [19], [26]. In this paradigm,each layer of the computing system (as shown in Figure 2) isconsidered as a graph composed of the system componentsof that layer connected to each other based on their func-tional connectivity. This graph could vary along the time. Asmany components are of the same type, the graph can bedecomposed into repeated cliques or sub-graphs; having thebehavior of the sub-graphs, the behavior of the graph can beeasily calculated. The sub-graphs can be considered as super-components and can compose a higher level of representation.The super-components on a level themselves can also bemerged into sub-graphs (super-components) of the next level.This brings up a multi-level representation for each layer,and also converts the problem into a combinatorial problemof sub-graphs. At each level, a sub-graph consists of a setof directly connected components of that level (which couldbe by themselves sub-graphs (super-components) at a lowerlevel). Some basic examples are shown in Figure 5. Servers,switches, and network connections are shown by blue squares,orange circles, and green ovals respectively. To represent agraph/clique on the lth level with n sub-components and atopology T , we use the notation GT

n,l. When the details arenot required, we use Gi to represents a sub-graph.

For the sake of simplicity, we assume all components arefully maintained/repaired to their best status at t = 0. Letdefine the PoA (or reliability) as PoAt0

G = R(G, t0) be theprobability of having the component G not failed over theinterval [0, t0]. PoAt0

G is a decreasing function of t0, and canbe related to its Cumulative Distribution Function (CDF) offailure:

PoAt0G = 1− CDFG(t0).

Considering a certain scaling factor s, the CDF(t0) can berelated to a Differential Density Function (DDF), DDFs(P0),where P0 is the CDF value at t0. The DDF is defined as:

DDFsG(P0 = CDFG(t0)) :=

1

s

∂CDFG(t)

∂t

∣∣∣∣t0

and, a CDF can be also inversely expressed based on its DDFby solving the following differential equation:

∂CDFG(t)

∂t= sDDFs

G(CDFG(t)), CDFG(0) = 0

(a) (b)

(c)Fig. 5. Various examples of sub-graphs.

(a) (b)

Fig. 6. a) The empirical CDF of the lanl05 database compared with its bestWeibull and tanh fits. b)The DDFs of 1- and 2-components systems.

In the rest of the paper, we assume s = 1.If a clique consists of two components, and if the full

availability is required, we have:

PoAt0G1∩G2

= PoAt0G1

PoAt0G2

= (1−CDFG1(t0))(1−CDFG2

(t0)).

From this, we can calculate the CDF of the combined system:

1− CDF(G1 ∩ G2, t0) = (1− CDFG1(t0))(1− CDFG2

(t0)).

Then,

CDF(G1 ∩ G2, t0) = CDFG1(t0) + CDFG2

(t0)

−CDFG1(t0)CDFG2

(t0).

Therefore, the DDF of the combined system is:

DDF (G1 ∩ G2, P0) = DDFG1(P0,1) + DDFG2

(P0,2)

−P0,1DDFG2(P0,2)− P0,2DDFG1

(P0,1),

where P0,1 = CDFG1(t0), and so on. For identical compo-

nents, we have:

DDF(G1 ∩ G2, P0) = 2(1− P0,1)DDFG1(P0,1)

= 2√

1− P0DDFG1(1−

√1− P0).

where P0,1 = CDFG1(t0) = 1−

√1− P0.

Various CDF functions have been used in the literature. Oneof interesting CDF functions is the Weibull distribution, whichhas been effective for large-scale systems [19]. It has twoparameters: the shape parameter β and the scale parameterδ. In contrast, in this work, we assume that the CDFs can beapproximated by the tanh distribution. We define a tanh CDFdistribution as:

CDF xc,xs(x) =1

Zxc,xs

{tanh(

x− xc2xs

) + tanh(xc2xs

)

},

where Zxc,xs = 1+tanh(xc/(2xs)) is a normalization factor,xc is the center parameter, and xs is the shape parameter. Thecorresponding tanh DDF function is:

DDF xc,xs,s(P0) = s+−s tanh

(12 log

{exp( xc

xs)−P0 exp( xc

xs)

P0 exp( xcxs

)+1

})22xs + 2xs tanh(xc/(2xs))

,

where P0 = CDF xc,xs(t0). As an example, the empiricalCDF of the (union-interpreted) lanl05 database [28], retrievedfrom the Failure Trace Archive (FTA) [18], is compared withits best fits using both the Weibull distribution [18] and thetanh distribution [fitted on the logarithm of time with optimalparameters xc = 5.564(±0.0035) and xs = 1.577(±0.0030)],and shown in Figure 6(a). The empirical CDF is shownin black, while the tanh and the Weibull distributions areshown in solid blue and dashed red lines respectively. For thesake of clarity, the absolute difference between the empiricaldistribution and the fitted distributions are also shown inpercentage in the figure. As can be seen, the tanh distributionhas a better fit to the real data. This is confirmed by itshigh p-values (compared to the traditional significance levelof 0.05) with respect to the Kolmogorov-Smirnov and theAnderson-Darling goodness of fit (GOF) tests: 0.4999 and0.5705 respectively. To calculate the p-values, averaging over1000 p-value estimations, each of which was calculated on arandomly-selected set of 30 samples from the real data set,was used. The profiles of the DDF functions of an 1- and a2-component cliques using the tanh distribution are shown inFigure 6(b).

For more complex sub-graphs and when partial availabilityis required, the calculations will be very tedious and vulnerableto errors. An example is the graph in Figure 5(c) whichcomposed of 11 components. In full-availability case, i.e.,requiring to have all four servers up and connected, the graphcan be broken down to two cliques of Figure 5(b) and oneclique of only one network connection. Therefore, the PoA ofthe whole graph will be:

PoAt0G11,1

= PoAt0G3,2

= (1− CDF(G5,1, t0))2 × (1− CDF(G1,1, t0)),

which can be expanded in full. The formula for the corre-sponding DDF is not provided because of limited space. Inthe case of partial availability, for example having at leastthree servers up and connected, the problem can be expressedas a combinatorial problem in which one instance of each oneof the cliques shown in Figures 5(a) and 5(b) and a network

(a) 1-component (b) 2-componentFig. 7. Monte Carlo validation of CDFs of 1- and 2-component systems.

(a) CDF (b) DDFFig. 8. The simulated Monte Carlo estimation of the CDF and DDF of a5-component system.

connection compose the graph. The corresponding possiblecases and their inter-relations makes the calculations verycomplex. This brings us the the second paradigm of simulationbehavior analysis presented in the next section.

B. Simulated Probabilistic Behavior Analysis

The second paradigm is based on simulations, in whichthe target system is implemented in a suitable environment,such as grid simulators1 or network simulators2, and thesystem characteristics are statistically estimated based on theproperties of its components. In order to build the statistics ofthe system, a series of simulated experiments is performed in aMonte Carlo approach, and then some characteristics, such asCDFs, are calculated. In order to show the correctness of theparadigm, the results obtained by the Monte Carlo analysis forone- and two-component systems are estimated and shown inFigure 7. The theoretical results are also shown as dashed linesfor the sake of comparison. These results have been obtainedwith averaging over 1000 simulations. The paradigm can easilyestimates the CDF and DDF of any system. For example, theCDF and the DDF of the 5-component sub-graph of Figure5(b) are estimated and shown in Figure 8 and are comparedwith two-component case. In this simulation, it is assumed thatthe servers, switch, and the network connections have tt = 3and tr = 1, tt = 3.5 and tr = 1, and tt = 4 and tr = 4,respectively. Some parametric models can be considered to fiton the simulation results to provide closed-forms models. The

1http://www.cloudbus.org/gridsim/;http://simgrid.gforge.inria.fr/2http://www.isi.edu/nsnam/ns/;http://www.nsnam.org/;http://www.omnetpp.

org/

(a) Full utilization.(b) Less-consuming partial utilization.Fig. 9. Consolidation of components without lowering the SLO using thePoA estimation.

simulated paradigm can be also used to validate the results ofthe theoretical models.

An application of probabilistic behavior analyzers is shownin Figure 9. The required availability is two servers and oneswitch. By estimating the PoA of the sub-graph consisting ofjust one switch (shown in Figure 9(b)), the system can shutdown one of the switches until the PoA reaches a predefinedthreshold. This not only saves a considerable amount ofenergy, it can increase the lifetime of the components.

C. Behavior-Time Profile Modeling and Analysis

The third paradigm works directly with the time profilesof the components. These time profiles, which are collectedin an opportunistic way by some agents, are learned andmodeled using various machine learning and pattern recogni-tion methods, such as Support Vector Machines (SVMs) [25],among others. The advantage of this paradigm is that it worksdirectly with the pattern, not with their statistical moments.Therefore, it can discover behaviors which can be missed bythe other paradigms. A typical scenario for the time-profilebehavior analysis is shown in Figure 10. In this example, theCPU and memory resource usage of a 2-component systemis shown. The BAU discovers a fault at 9:45AM, when thememory usage of the second component increases, based onthe models learned in the failure generating phase (section III),and the response of the CRU, based on this detection, preventsa failure at 10:00AM. This not only prevents an instance ofSLA violation, it also reduces the down time by 30 minutes.

V. RELATED WORK

There is a huge literature on the fault detection and faultavoidance topics [19], [26]. In [23], several probabilistic mod-els, such as the Weibull and hyper-exponential distributions,were fitted on empirical data available in three datasets:CSIL, Condor and Long. They showed that the Weibulland exponential distributions are more accurate to model thebehavior of resources. In [27], a data analysis was carriedout on the system logs and reports of a cluster system. Theyused time-series analysis and also rule-based classification forpredicting continuous and critical events. A similar analysiscarried out in [12] in which temporal and spatial correlationsamong failure events in coalition systems are used. In [14], ananalysis of runtime data was performed to detect anomalousbehaviors. For this purpose, an outliers analysis was used.

In [11], Support Vector machines (SVMs), random indexingand runtime system logs were used for predicting failures.In [4], an online predictor for mission critical distributedsystem was presented. The predictor was based on the analysisof the network traffic. A modular fashion for integrating acognitive operator into a system was presented in [6]. Andin [2], cognitive map modeling was used for the purpose offailure analysis of system interaction failure modes. Finally,treatment learning enables tracking back the failures in large-scale systems is of great value in order to identify the causingcomponent or factor [13]. This not only reduces the expensesand experts’ time, it also reduces the chance of secondaryfailures related to human mistakes of the experts.

The main highlight of our approach compared to othersis in its multi-granularity nature that comes from variousdimensions of the framework. In one dimension, multi-levelanalysis of the system (graphs) enables the framework tohorizontally and vertically scales while avoiding exponentialcomputational and analysis costs associated to scaling. Onanother dimension, its multi-layer aspect provides a systematicand separable approach for analysis of the behavior of thenon-hardware parts (software, virtualware, etc). It can beargued that the main performance bottleneck of future systemsroots in the errors and faults of their non-hardware parts.Finally, in a third dimension, multi-paradigm approach of theframework paves the way for cognitive responding in a cross-cover manner.

VI. CONCLUSION AND FUTURE PROSPECTS

A multi-paradigm, multi-layer, and multi-level cognitivebehavior analysis framework has been introduced. The frame-work uses probabilistic (statistical inference), simulated (sta-tistical inference by means of simulation), and time-profilemodeling and analysis in order to learn and model variousbehaviors of complex computing systems. Its multi-paradigmapproach enables validation and cross-cover among variousparadigms. The framework can perform at multiple granulari-ties thanks to its multi-level and multi-layer approach. Thisfacilitate i) systematic horizontal, vertical and hierarchicalscaling, ii) straightforward integration of non-physical parts(software, virtualware, etc) in the analysis, iii) increasing thesystem dependability, such as the Probability of Availability(PoA), achieved by smart, cross-covered cognitive responding.Also, a new distribution, the tanh distribution, has beenintroduced with promising results on a real database. Theapplication of the framework in failure analysis and detectionhas been discussed in this work. The framework is speciallydesigned toward application in open-source architectures suchas OpenStack3 and OpenGSN4 that will be considered asreal-system examples in the future work. Furthermore, moresophisticated distributions, such as asymmetrical tanh andspline distributions, will be introduced.

3http://www.openstack.org/4http://www.greenstarnetwork.com/

Fig. 10. A typical example of time-profile behavior analysis and its impact on the overall grade improvement.

ACKNOWLEDGMENTS

The authors thank the NSERC of Canada for their financialsupport.

REFERENCES

[1] Ethem Alpaydin. Techniques for combining multiple learners. InProceedings of Engineering of Intelligent Systems, pages 6–12. ICSCPress, 1998.

[2] Manu Augustine, Om Yadav, Rakesh Jain, and Ajay Rathore. Cognitivemap-based system modeling for identifying interaction failure modes.Research in Engineering Design, 23(2):105–124, 2012.

[3] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic conceptsand taxonomy of dependable and secure computing. IEEE Transactionson Dependable and Secure Computing, 1(1):11–33, 2004.

[4] Roberto Baldoni and et. al. Online black-box failure prediction formission critical distributed systems. Technical Report 3/12 - 2012,MIDLAB, 2012.

[5] Reinette Biggs and et. al. Toward principles for enhancing the resilienceof ecosystem services. Annual Review of Environment and Resources,37(1):null, 2012.

[6] Sven Burmester and et. al. Tool support for the design of self-optimizingmechatronic multi-agent systems. STTT, 10(3):207–222, 2008.

[7] N.-G.M. Leme E. Martins, C.-M.F. Rubira. Jaca: A reflective faultinjection tool based on patterns. In DSN’02, pages 483–487, Maryland,USA, 23–26 June 2002.

[8] Malin Falkenmark. Governance as a Trialogue: Government-Society-Science in Transition, chapter Good Ecosystem Governance: BalancingEcosystems and Social Needs, pages 59–76. Water Resources Develop-ment and Management. Springer Berlin Heidelberg, 2007.

[9] Fereydoun Farrahi Moghaddam, Reza Farrahi Moghaddam, and Mo-hamed Cheriet. Carbon metering and effective tax cost modeling forvirtual machines. In CLOUD’12, pages 758–763, Honolulu, Hawaii,USA, June 2012.

[10] Fereydoun Farrahi Moghaddam, Reza Farrahi Moghaddam, and Mo-hamed Cheriet. Multi-level grouping genetic algorithm for low carbonvirtual private clouds. In CLOSER’12, pages 315–324, Porto, Portugal,April 18–21 2012.

[11] Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and JelenaVlasenko. Failure prediction based on log files using random indexingand support vector machines. Journal of Systems and Software, InPress(0):–, 2012.

[12] Song Fu and Cheng-Zhong Xu. Exploring event correlation for failureprediction in coalitions of clusters. In SC’07, pages 1–12, Reno, Nevada,2007. ACM.

[13] Gregory Gay, Tim Menzies, Misty Davies, and Karen Gundy-Burlet.Automatically finding the control variables for complex system behavior.Automated Software Engineering, 17(4):439–468, 2010.

[14] Qiang Guan and Song Fu. auto-AID: A data mining framework forautonomic anomaly identification in networked computer systems. InIPCCC’10, pages 73–80, 2010.

[15] Mei-Chen Hsueh, T.K. Tsai, and R.K. Iyer. Fault injection techniquesand tools. Computer, 30(4):75 –82, Apr 1997.

[16] Bing Huang, M. Rodriguez, Ming Li, J.B. Bernstein, and C.S. Smidts.Hardware error likelihood induced by the operation of software. IEEETransactions on Reliability, 60(3):622–639, 2011.

[17] K. Keahey, M. Tsugawa, A. Matsunaga, and J. Fortes. Sky computing.IEEE Internet Computing, 13(5):43–51, 2009.

[18] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. Thefailure trace archive: Enabling comparative analysis of failures in diversedistributed systems. In CCGrid’10, pages 398–407, 2010.

[19] Antonios Litke, Dimitrios Skoutas, Konstantinos Tserpes, and TheodoraVarvarigou. Efficient task replication and management for adaptive faulttolerance in mobile grid environments. Future Generation ComputerSystems, 23(2):163–178, February 2007.

[20] Daniele Miorandi, Sabrina Sicari, Francesco De Pellegrini, and ImrichChlamtac. Internet of things: Vision, applications and research chal-lenges. Ad Hoc Networks, 10(7):1497–1516, September 2012.

[21] D. Sousa N. Rodrigues and L.M. Silva. A fault-injector tool to evaluatefailure detectors in grid-services. In CoreGRID’07, pages 261–271,Heraklion, Crete, Greece, 12–13 June 2007.

[22] R. Natella, D. Cotroneo, J. Duraes, and H. Madeira. On fault represen-tativeness of software fault injection. IEEE Transactions on SoftwareEngineering, Accepted(99):–, 2012.

[23] Daniel Nurmi, John Brevik, and Rich Wolski. Modeling machine avail-ability in enterprise and wide-area distributed computing environments.In Jos Cunha and Pedro Medeiros, editors, Lecture Notes in ComputerScience (Euro-Par 2005 Parallel Processing), volume 3648, pages 612–612. Springer, 2005.

[24] Fabio Oliveira and et. al. Barricade: defending systems against operatormistakes. In EuroSys’10, pages 83–96, Paris, France, 2010. ACM.

[25] Juan Josa Rodrıguez, M.guez, Carlos J. Alonso, and Josa A. Maestro.Support vector machines of interval-based features for time seriesclassification. Knowledge-Based Systems, 18(45):171–178, August 2005.

[26] Brent Rood and Michael Lewis. Grid resource availability prediction-based scheduling and task replication. Journal of Grid Computing,7(4):479–500, 2009.

[27] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma,R. Vilalta, and A. Sivasubramaniam. Critical event prediction forproactive management in large-scale computer clusters. In KDD’03,pages 426–435, Washington, D.C., 2003. ACM.

[28] B. Schroeder and G.A. Gibson. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable andSecure Computing, 7(4):337–350, 2010.

[29] A.P. Snow and G.R. Weckman. What are the chances an availabilitySLA will be violated? In ICN’07, pages 35–35, Martinique, 22-28 April2007.

[30] The Climate Group. SMART 2020: Enabling the low carbon economyin the information age. Technical report, the Global eSustainabilityInitiative (GeSI), 2008.

[31] Annemijn Van Gorp and Catherine A. Middleton. Fiber to the homeunbundling and retail competition: Developments in the netherlands.Communications and Strategies, 78(2):87–106, June 2010.

[32] Keun Soo Yim, Z. Kalbarczyk, and R.K. Iyer. Measurement-basedanalysis of fault and error sensitivities of dynamic memory. In DSN’10,pages 431–436, Chicago, IL, USA, June 28 2010-July 1 2010 2010.

cognitive behavior analysis framework for fault prediction ... · pdf filecognitive behavior...

Documents