business process impact visualization and anomaly detection

Business process impact visualization and

anomaly detection

Ming C. Hao1

Daniel A. Keim2

Umeshwar Dayal1

Jorn Schneidewind2

1Hewlett Packard Labs, Palo Alto, CA, U.S.A;2University of Konstanz, Konstanz, Germany

Correspondence: Daniel A. Keim, ComputerScience Institute, Universitat Konstanz BoxD78, Universitatsstr. 10, 78457 Konstanz,Germany.Tel: þ49 7531 88 3161;Fax: þ49 7531 88 3062;E-mail: [email protected]

Received: 25 February 2005Revised: 29 November 2005Accepted: 3 January 2006;Online publication date: 10 April 2006

AbstractBusiness operations involve many factors and relationships and are modeled as

complex business process workflows. The execution of these business processes

generates vast volumes of complex data. The operational data are instances ofthe process flow, taking different paths through the process. The goal is to use

the complex information to analyze and improve operations and to optimize

the process flow. In this paper, we introduce a new visualization technique,called VisImpact that turns raw operational business data into valuable

information. VisImpact reduces data complexity by analyzing operational data

and abstracting the most critical factors, called impact factors, which influencebusiness operations. The analysis may identify single nodes of the business flow

graph as important factors but it may also determine aggregations of nodes to

be important. Moreover, the analysis may find that single nodes have certain

data values associated with them which have an influence on some businessmetrics or resource usage parameters. The impact factors are presented

as nodes in a symmetric circular graph, providing insight into core business

operations and relationships. A cause–effect mechanism is built in to determine‘good’ and ‘bad’ operational behavior and to take action accordingly. We have

applied VisImpact to real-world applications, fraud analysis and service contract

analysis, to show the power of VisImpact for finding relationships among themost important impact factors and for immediate identification of anomalies.

The VisImpact system provides a highly interactive interface including drilldown

capabilities down to transaction levels to allow multilevel views of business

dynamics.Information Visualization (2006) 5, 15–27. doi:10.1057/palgrave.ivs.9500115

Keywords: Business intelligence; visual data mining; visualization technique

IntroductionWhat is the cause of an unfulfilled contract? If the problem continues for aday, how much will it cost? Which customer orders will be impacted? Toanswer such questions, many research efforts have focused on howto transform the business process data, as logged by the IT services, intovaluable information. However, owing to multiple factors and thecomplexity of business operations, analysts face the challenge of under-standing the underlying data and finding the important relationships fromwhich to draw conclusions.

Charts, scatter plots, and spreadsheets are commonly used, but areinappropriate to visualize relations between multiple business parametersin a single view. Thus, analysts usually need to view pages of charts andreports to obtain the information they need to make informed decisions.However, the degree of difficulty increases as business operation complex-ity grows. Therefore, analysts need tools that provide them insights into

Information Visualization (2006) 5, 15–27

& 2006 Palgrave Macmillan Ltd. All rights reserved 1473-8716 $30.00

www.palgrave-journals.com/ivs

http://www.palgrave-journals.com/ivs/

http://www.ub.uni-konstanz.de/kops/volltexte/2008/6896/

http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-68969

important business factors in the context of their overallbusiness operations, for example, business operations onwhich these business factors depend on, or the impact ofthe changes of business factors.

In this context, visualization technology plays animportant role and a number of advanced visualizationtechniques for business impact analysis have beenproposed in the past. Parallel coordinates,1 for example,is a geometric projection technique used for multidimen-sional visualization and automatic classification. SeeSoft’sline representation technique2 is used to visualize Year2000 program changes. ILOG JViews is a tool foranalyzing workflow processes. E_Bizinsights is used toprovide dimension-based structures for web path analy-sis. All these example techniques can effectively displaydata for business decision making. The techniques startwith the presentation of whole business operation andprovide drill down capabilities to the analyst to obtaindetailed information.

In this paper, we introduce a new visualizationtechnique called VisImpact. This technique extendsexisting approaches by providing an integrated analysisof business process schema and business process in-stances in order to improve business operations. First,VisImpact analyzes the business process schema and datato identify the important factors that influence thebusiness metrics and resource parameters. These impor-tant impact factors and the corresponding businessprocess instances are then represented as nodes and lineson a symmetric circular graph to display the importantrelationships and details of the process flows. VisImpact isdifferent from ILOG JViews workflow visualizations inthat JViews displays the process workflow schema as astandard graph and uses standard charts to plot highlyaggregated summaries of the workflow execution data.Our approach in contrast uses advanced analysis techni-ques to determine the important impact factors, whichare then visualized as circular graphs. After presenting thebasic idea of the VisImpact technique in the next section,we introduce the analysis techniques integrated inVisImpact for extracting relevant factors from businessprocesses (Section on Determining the impact relation-ships) and describe how these information is mapped tocircular graphs (Section on The circular VisImpact graph).In the Section on The VisImpact system we describe theVisImpact system and show its applications to real-wordfraud analysis and business contract analysis withpromising results (Section on Applications). Section onEvaluation and comparison evaluates the proposedsystem. A case study showing additional applications ofthe technique is presented in Hao et al.3

Basic idea of VisImpactBusiness operation data is inherently complex, mostoften too complex to be directly visualized. Usually thebusiness operations consist of many steps and alterna-tives and every data instance may take a different paththrough the process. To understand the business process

and the impact of certain parameters, in the first stepVisImpact analyzes the process data. As an introductoryexample, Figure 1 shows a simplified version of a productorder activity process schema. Note that this businessprocess is a very simple one; realistic business processesare at least 10 times larger. To simplify the complexity ofthe visualization, VisImpact automatically abstracts im-portant attributes, called impact factors, using datamining techniques ranging from statistical correlationanalysis and partial matching to techniques used inclustering and classification analysis. Additionally, theanalyst may steer the analysis process by interactivelyselecting business attributes and metrics of interestmanually, and the system detects impact factors for thecorresponding selection. The analysis results are thenstored in an impact factor matrix. Finally, VisImpacttransforms these impact factors to nodes, with linesbetween nodes on a symmetric circular graph, represent-ing a portion of the business operation.

The goal was to provide an appropriate visual layout,that is able to represent an abstraction of the underlyingcomplex workflow, but at the same time shows therelationships between discovered relevant business im-pact factors for further visual analysis. In general, eachbusiness workflow can be modeled as directed graphcontaining three different categories of nodes: startnodes, end nodes, and inner nodes including differentnode types like decision or action nodes. The edgesbetween the nodes define the process flow. Typically theprocess flow starts at a start node, passes certain innernodes and ends at an end node. We adapted this principleby generating circular graph layouts consisting of threetypes of nodes, representing three types of nodes ofthe underlying business process schema. The edges betweenthe nodes represent the process instances and their colorrepresents an additional attribute, for example, a businessmetric. Thus, each circular graph visualizes four differentattributes:

� Source Attribute for partitioning the left side of thecircle;

� Intermediate Attribute for partitioning the center axis;� Destination Attribute for partitioning the right side;

Figure 1 A product order activity workflow.

16

� Color Attribute (colored lines) for visualizing business

metrics, such as response time, dollar amount, and

contract fulfillment degree.

Each process instance is shown as a line-connectingsource, intermediate and destination nodes. VisImpactrecords the process instances in the process flow map,so that analysts are able to track the cause–effect pathsin real time. Each circular graph represents a specialimpact relationship, and multiple circular graphsare linked together to show whole business processoperations.

In the product order activity example, VisImpact uses aclustering algorithm to identify types of customers (Gold,Silver, Regular) and order processing duration times (verylong, long, medium, short) and present it in relationshipto the outcome of the decision (Accept, Reject, Cancel) asshown in Figure 2.

The lines are colored according to the duration time. Inthe example, it is obvious that the higher the rank of thecustomer, the faster the processing of the order andthe higher the likelihood of an acceptance decision. Sincethe data is shown on an instance level (and not in anaggregated form) it becomes clear that for a regularcustomer the duration time varies largely and that thelikelihood of a reject decision increases with higherprocessing durations. Note that there are also someexceptions to this general tendency which can be seenby the red line that ends in the accept node. The secondVisImpact graph (see Figure 3) shows a second impactrelationship, namely between specific instances of thenegotiation nodes, duration time, and outcome ofthe decision. Note that in this case the negotiators arepartitioned into two specific negotiators A1 and A2, thena group of negotiators AG, and all other negotiators

Others. The four groups have been identified by theautomatic algorithm based on the similarity of thebusiness process data instances.

In the third graph (see Figure 4), we show howVisImpact can be used to analyze product order contractfulfillment, that is, the ability to deliver certain productswithin a specified time period. There is a penalty cost if acontract is delayed beyond a certain number of days. TheVisImpact system determines that penalty cost is closely

Gold

Silver

Regular

Accept

Reject

Cancel

Exception

Source Attribute (Customer type)

Intermediate Attribute (Duration time)

Destination Attribute(Outcome)

lowDuration time

high

short

medium

very long

long

Figure 2 Product order activity by process duration times

(color represents process duration time).

Gold

Silver

Regular

Accept

Reject

Customer type Negotiators Outcome

low

Duration time

high

A1

A2

AG

Others

Figure 3 Order activity by negotiators.

small business

medium business

Large enterprise

$ 500

$ 1000

Customer type Delay Penalty

low

Delay

high

3 day delayIndividual

1 day delay

2 day delay

Figure 4 Order activity by delay.

17

related to specific delay times. Note that for correlatingpenalty cost and delay time to customers, the systemproposes a different classification of the customer nodes,namely large enterprise, medium business, small business,and individual. To make large volumes of data easy toexplore and interpret, VisImpact links multiple circulargraphs and records the path of process instances in theprocess flow map. Therefore, analysts can quickly click ona node or a line to observe all linked operation flowsacross different graphs in real time.

Formal definition of VisImpactIn the following, we formally introduce the techniquesused to generate VisImpact visualizations. The first step isthe identification of all relevant impact relationship. Forthis step we perform a global correlation analysis and usepartial matching, cluster and classification analysistechniques. The result of this step are triples of relatedattributes which are then visualized as nodes and theinstances are represented as edges of a graph. Theproblem is to find a good graph layout that supportshuman problem solving and decision-making processes.There are some general requirements that graph layoutsfor human consumption should fulfil, known as aes-thetics criteria.4 Some important criteria for VisImpact aredisplay symmetry, edge crossing reduction, uniformvertex distribution, and uniform edge lengths. Addition-ally, the layout should present an ordering of the nodescorresponding to the business parameters and present anabstraction of the data. In addition, the layout shouldallow a visualization of large data volumes. A circularlayout is chosen, because it provides a good compromisebetween all requirements.5

Determining the impact relationshipsThe first step of VisImpact is to determine the mostimportant impact relationships. For this step we use(semi-) automatic data mining techniques, namelystatistical correlation analysis, partial matching techni-ques, as well as cluster and classification analysis.

Statistical correlation and similarity analysis First, wedetermine the pairwise global correlations among allmeasurements as given by Pearson’s correlation matrix.Pearson’s correlation coefficient r between bivariate data,A1i and A2i values (i¼1,y, n) is defined as

r ¼Pn

i¼1 ðA1i � �A1ÞðA2i � �A2ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1 ðA1i � �A1Þ

2 Pni¼1 ðA2i � �A2Þ2

q ; ð1Þ

where A1 and A2 are the means of the A1i and A2i values,respectively. If two dimensions are perfectly correlated,the correlation coefficient is 1, in case of an inversecorrelation �1. In case of a perfect correlation, we canomit one of the attributes since it contains redundantinformation. In most cases, however, the correlations arenot perfect and we are interested in high correlationcoefficients and select sets of three highly correlated

attributes to be visualized in VisImpact. Other statisticalcorrelation coefficients such as the Spearman’s correla-tion are provided in VisImpact as well.

An available alternative for adjacently depicting similardimensions is to use the normalized Euclidean distance asa measure for global similarity Simglobal defined as

SimGlobal ðAi; AjÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN�1

i¼0

ðb1i � b2

i Þ2

vuut ; ð2Þ

where

bji ¼

aji � MINðAjÞ

MAXðAjÞ � MINðAjÞ:

In order to become more robust against outliers, insteadof using MAX (the 100%-quantile) and MIN (the0%-quantile), we use the 98 and 2% quantile of the attribute.The global similarity measure compares two wholedimensions such that any change in one of the dimen-sions has an influence on the resulting similarity. Thedefined similarity measure allows it to determine triplesof similar attributes for the successional visualization.Since in general, computing similarity measures is a non-trivial task, because similarity can be defined in variousways and for specific domains, the modular design of theVisImpact system allows the integration of specificsimilarity measures with little effort, like similaritymeasures proposed in the context of time series data6 orsimilarity measures presented in.7

Partial similarity In real business process applicationsglobal similarities are rare, since in most cases correla-tions only occur for certain subsets of the data. Imaginefor example two business measures over time, like theduration time for Gold and Silver customers in Figure 2.There may be short periods where the two measures showa similar behavior, for example, because of some globaldevelopment. However, it is unlikely that they behavesimilar over days or weeks. In the impact relationshipanalysis we therefore have to analyze the data for partialsimilarities. In our application scenario, we are especiallyinterested in periods where two attributes behavedsimilar. Thus, given the two variables Ak and Al, thesynchronized partial similarity8 measure is employed , todetect pairwise attributes with periods of similarity in thedata:

SimSyncðAk; AlÞ ¼

maxi;j

ðj � iÞjð0iojoNÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXj

z¼i

Czoe

vuut8<:

9=;;

ð3Þ

where Cz¼ (bzk�bz

l )2, with bij defined as above and A is

some maximum allowed dissimilarity. This partial simi-larity measure uses the length of the longest sequencethat is at least A-similar (under scaling and translationinvariance). Triples of attributes with pairwise maximumSimSync values are then selected for VisImpact analysis.Depending on the application, the partial similarity may

18

also be an Unsynchronized Partial Similarity.8 In thiscase, two dimensions do not have to be similar at thesame ‘time but in an arbitrary time frame of the samelength. Since computing partial matchings is a time-consuming process, most approaches like those proposedby Yang et al.9 and Faloutsos et al.10 also use someheuristics and index structures to speed up the computa-tion,11 that will be considered in future extensions ofVisImpact.

Cluster analysis For some attributes, the parametervalues are continuous (such as dollar amount), for others,there are large numbers of categorical values (such asexpense requestors). In order to perform a useful impactanalysis, it is important to partition the value rangesappropriately. Cluster analysis can help to do this basedon the characteristics of the data instances. The clusteranalysis may, for example, find out that – based on thecharacteristics of their product order flows – the compa-nies may be partitioned into three groups (gold, silver,regular) and the negotiators into two single ones (A1, A2)and two groups (AG, Others).

There are a large number of clustering methods whichhave been proposed in the literature, one of the mostgeneral techniques is kernel density estimation.12 Inkernel density estimation, the influence of each datapoint is modeled using a kernel function, and the overalldensity of the data is calculated as the sum of the kernelfunctions of all data points. Clusters can be derived froma density function by density based single linkage orhierarchical clustering. Owing to the large number ofanalyses that need to be performed in the VisImpactframework, we have to use an efficient implementationof kernel density estimation, and therefore the DENCLUEalgorithm13 is employed.

Classification analysis In some applications, the goal ofthe data exploration is to understand the relationshipbetween the business process data and some specificbusiness metrics such as response time, dollar amount, ordegree of contract fulfillment. If the analyst is interestedin a specific business metric, we can perform theautomatic analysis with the business metric as targetattribute. The task is to find the business processparameters that are best predicting the outcome of thetarget attribute. A well-known heuristic for this task is theGINI index, which is also used in decision-tree construc-tion. Given a business metric B which is partitioned intoa disjoint set of k classes (e.g. accept, reject) or valueranges (e.g. large, medium, small) denoted by C1,y, Ck,(B¼,i¼1

k Ci), then the GINI index of an attribute Awhich induces a partitioning of A into A1,y, Am isdefined as

InfoGainGINIðB; AÞ ¼Xm

i¼1

jAijjBjGINIðAiÞ; ð4Þ

where

GINIðAiÞ ¼ 1 �Xk

j¼1

jCjjjAij

� �2

:

The InfoGain is determined for all attributes andattribute combinations and the two attributes with thehighest InfoGain with respect to the target attribute B arechosen for visualization. Alternatively, we use theattribute Ax with the highest InfoGain and then repeatthe calculation with Ax as target attribute to find thesecond attribute to be displayed.

The circular VisImpact graphThe business impact visualization is defined as a graphG¼ (V, E), where V is a set of nodes connected by edges E.The node set V is partitioned in k subsets V1,y, Vk

depending on k partitioning attributes. Each edge (u,v)AE implies either uAVi and vAViþ 1 or uAViþ 1 andvAVi, iA1,y, k–1. The nodes V represent the set of dataitems for the corresponding k classes of V and the edgesrepresent the relationships and interactions betweenthem. An edge can have at least two attributes, showingcharacteristics of the relationship, represented by widthand color of the edge.

In the VisImpact System, a special case of circulargraph is used, where the node set V of the graph con-sists of three subsets V1, V2, V3, V¼V1

SV2

SV3,

(Vi

TVj¼f) iaj). The set of source nodes V1, is

determined by the first attribute (source attribute). Thesecond attribute (intermediate attribute) determinesthe subset V2 of intermediate nodes, and the third attribute(destination attribute) determines V3, the set of destina-tion nodes. Corresponding to the definition of thegeneral circular graph, there exist only edges e¼ (u,v)AE between V1 and V2 or V2 and V3. In order to presentthe given nodes and edges in a circular layout, let C¼ (x,y, r) be a circle with center (x, y) and radius r in the 2D-plane. We introduce a screen positioning function f:V-R2, which determines for each node vAV the x/y-position(v.x, v.y) on the circle.

Since we want to visualize the relations and in-teractions between three sets of nodes, we divide thecircle C in three regions to place the nodes from thethree sets V1, V2, V3. The nodes of V1 are placed onthe left side and the nodes of V3 are placed on the rightside of the circle, which means for all nodes vAV1

SV3

holds:

Cr2 ¼ ðvx � CxÞ2 þ ðvy � CyÞ2: ð5Þ

For all nodes viAV1 is vi.x–CxoCx and for all nodesvjAV3 is vj.x–Cx4Cx. The nodes of V2 are placed on thecenter axis of the circle, which means on a line fromPoint P1(Cx, Cy–Cr) to Point P2(Cx, CyþCr), so that forall vjAV2

vjx ¼ Cx and Cy � CrovjyoCy þ Cr:

19

Computing the node positions The placement of nodeson the circle axis is straightforward and depends only onthe selected mapping. To place nodes on the left or righthalf of the circle, the positioning function f employs theradian f to compute the position for each nodedepending on the selected mapping, as shown in Figure 5.For quantitative data, linear mapping is used to map thedata points to the left side, the right side or the centeraxis of the circle, and the radian is determined accord-ingly. Optional, the data points can be placed in anordered equidistant manner. This is especially useful forcategorical data or in cases where the analyst is moreinterested in the process flow than in exact nodevalues. The radian f for a node viAV1 is then defined asfollows:

f ¼ p� a �1

2þ i

n

� �; i ¼ 0; . . . ;n: ð6Þ

The angle a, 0oaop, describes the positioning area ofthe nodes, shown in Figure 5. In order to position thenodes of V1 on the left side of the circle (i.e. 0.5pofo1.5p), we set a¼ cp, 0oco1. For placements on theright side of the circle, that is, for positioning of all nodesvAV3, p–a has to be replaced by a in the equation above.The parameter c separates the nodes on the right and lefthalf of the circle and the nodes in the middle. The termi/n divides the drawing area, given by a, in n equi-distant locations in order to place the n nodes from V1.The radian f is used to compute a position for each nodevi A V1:

vix ¼Cx þ cosðfÞ Cr;

viy ¼Cy þ sinðfÞ Cr:

The node positions for the nodes vAV3 can becomputed analogical. Color coding and tool tip techni-ques are used to represent relevant node attributes.

Weighted node positions In order to give importantinformation in our visualization more attention, anoptional weight function can be used. Instead of justordering the nodes according to their values and thenplace the nodes on the circle in an equidistant manner,this weight function gives important nodes more spaceon the screen while less important nodes get less space,realized by a weighted computation of the radian f, asshown in Figure 5. The weight weighti of a nodeviAVl,iA(1,y, N), depends on a forth attribute A. Wedefine the weight by the ratio of vi’s attribute aiAA andthe sum of all attributes ajAA, |A|¼N, where iA(1,y, N):

weighti ¼ai

PNj¼1

aj

: ð7Þ

After computing a weight for each node, VisImpactorders the nodes by their weights and places them bystarting at the top of the circle. The weighted positioningfunction w distributes the available space on the circle tothe nodes by calculating a weighted radian fweight foreach node viAV1, iA(1,y, N):

fweight ¼p� a �1

2þ wðiÞ

� �;

wðiÞ ¼Xjoi

j¼0

weightj:

ð8Þ

In order to place the nodes in V3 on the right side of thecircle, p–a has to be replaced by a in the formula above.

Placement of categorical attributes In cases where theordering of nodes in the VisImpact visualization is notimplicitly given by the node values, for example, forcategorical attributes like customer name or customer type,the analyst is typically only interested in the process flowbetween certain attributes. The goal then is to find acircular node layout that reduces edge crossings, sincethey may reduce the readability of the resulting graph.Therefore, a placing method that reduces edge crossingsby rearranging single nodes is integrated into theVisImpact system to place nodes with no implicitordering. Since in general, the problem of finding vertexorderings that minimize edge crossings in a layered graphis NP-hard, even for three-layered graphs as used byVisImpact,14 heuristics are needed to solve even moder-ately sized problems.

Let G¼ (V, E), V¼V1

Sy

SVk, Vi

TVj¼f 3 iaj, be a

general circular graph as described above. An orderinglayer Vi, iA(1,y, k–1) is specified by a permutation pi ofVi. We express the ordering of Vi by the permutation pi.Let cross (G, p1,y,pk) be the number of edge crossings in

M (C.x, C.y)

(C.x, C.y - radius)

(C.x, C.y+radius)

V31 (x,y)

cos

sin

radius

0

0.5

1.5

V11

V12

V13

V1n

V21

Source Attribute V1

Intermediate Attribute V2

Destination AttributeV3

c

V2k

V3m

w (V11)

w (V12)

αϕ

ϕ

ϕππ

π

π

Figure 5 Computation of node positions on the circular Vismap

layout.

20

a straight line drawing of G given by p1,y,pk. Theminimum number of edge crossings that can be achievedby reordering the vertices in V1,y, Vk is denoted byopt(G):

OptðGÞ ¼ minp1;...;pk

crossðG; p1; . . . ; pkÞ: ð9Þ

Having three sets of nodes V1, V2, V3, VisImpactcomputes a minimal edge crossing by dividing thisthree-layered crossing minimization problem into twotwo-layered One Sided Crossing Minimization Problem:

Opt 0ðGÞ ¼ minpi ;piþ1

crossðG; pi; piþ1Þ; i ¼ 1; 2: ð10Þ

Opt0(G, pi) denotes the minimal attainable number ofedge crossings by fixing the permutation of Vi andreordering the nodes of Viþ 1. The Barycenter heuristic15

is used to compute such a node ordering. The basic ideaof this heuristic is to simply compute the averageposition, that is, the Barycenter, for each node and thensort the nodes according to these numbers. In typicalapplication scenarios not all three attributes will benominal or categorical without given orders, whichrestricts the crossing minimization process.

The VisImpact system

System architecture and componentsTo analyze large volumes of transaction data with manyimpact factors, VisImpact has been integrated into thevisual data mining system VisMine.16 The system uses aweb browser with a Java activator to allow real-timeinteractive visual data mining over the web. TheVisImpact architecture contains four basic components:

1. Abstract componentThe abstract component of VisImpact derives an impactfactor matrix from the input data. The system can thenautomatically show a number of VisImpact visualiza-tions based attributes with high-impact relationships.Alternatively, the analysts can select a pair of impactfactors from the matrix based on the knowledge of theuser and the application requirements. Then, from thepair of impact factors, VisImpact automatically ab-stracts the third impact factor that has the highestimpact value with the selected pair of impact factors.In addition, VisImpact provides a control window toallow analysts to real-time select impact factors forfurther analysis.

2. Layout componentThis component orders, groups, maps, and weights theabstracted impact factors and transforms their rela-tionships to lines between two nodes according to theprocess flow map. Nodes and lines are laid out on asymmetric circular graph. The width of the linerepresents the number of lines with the same processflow. The color of a line is the average value of a dataitem. Nodes and lines on the circular graph are

weighted by the average value of a data item. Nodeswith higher weights are given more space on thegraph. The label of the nodes with the highest weightsis colored red. Lines with the highest weights aredrawn last to avoid overlapping.

3. Interaction componentTo make the circular graph easy to explore andinterpret, VisImpact provides fade-in, fade-out, click-ing, and drill-down capabilities.

4. Extension componentAdditional circular graphs are generated as the numberof selected attributes grows.

Anomaly detectionVisImpact employs a process flow map to link relatedprocess flows and relationships across different circulargraphs. VisImpact instantly detects exceptions by findingred lines (exceed threshold) or crossed lines (anomaly) ina graph. For example, in fraud analysis (Section on Fraudanalysis), a high fraud amount usually is associated with ahigh fraud count. There could be a potential problem if ahigh fraud amount occurs with a low fraud count asshown later in Figure 6a. VisImpact provides the followingvisual capabilities:

� Fade-in and fade-outVisImpact allows analysts to focus on an outlier andfade in related process paths. The unrelated nodes andflows are faded out. The analyst can easily discover thesource of the problem by tracing the lines starting fromthe anomaly.

� Drill downThe analysts select a single node or a single line to drilldown to the transaction level to display multilevelviews of business dynamics.

ApplicationsWe have experimented with VisImpact for fraud analysis,a case study is presented in,3 and service contract analysisusing real-word business data.

Fraud analysisFraud is one of the major problems faced by manycompanies in the banking, insurance, and telephonyindustries. Over $ 2 billion in fraudulent transactions areprocessed yearly on electronic payments. Transformingraw transaction data into valuable business operationinformation to enable fraud analysis will save companiesmillions of dollars. Fraud analysis specialists require toolsthat help them to better understand fraud behavior andimpact factors as well as to identify unusual exceptions.Typical questions in fraud analysis are:

1. What is the fraud growth rate in recent years and whatare the impact factors?

2. Which sales region and sales type has the most fraud?3. Are there any outliers and what is their cause–effect?

21

To address these three questions, VisImpact first selectsthree highly correlated attributes from the impact factormatrix such as purchase quarter, fraud amount (aggre-gated), and fraud count. Using these three impact factors,VisImpact lays out the nodes and flows in a circular graphas shown in Figure 6a. Figure 6a shows that there are highcorrelations (more parallel lines) between fraud amountand fraud count. Colors represent the value of the fraudamount; red represents a fraud amount that is in the top10%. Most important, there is an outlier (a red line)crossing from low fraud count (5 counts) to a very highcash advance with a fraud amount of $ 28,107,100. Thisexceptional transaction might be a potential problem orerror.

To understand which sales regions or sales have themost fraud, the analyst selects the region as the sourcenode and sales type as the destination node from the userdomain knowledge and draws a second circular graph asshown in Figure 6b. From Figure 6b, it can be learned thatregion 6 has the highest fraud amount with more red,pink, burgundy and blue lines than other regions.Purchase has a higher fraud amount than cash advance.This is because purchase has many more red andburgundy lines than cash advance.

To find outliers and their root-cause, VisImpact uses theprocess flow map to identify related operation paths of atransaction record and to discover exceptions. Mostinterestingly, an outlier is seen as a red line crossingfrom cash to the high fraud amount. The analyst caneasily move the pointer to find detailed informationabout this outlier, such as the amount and purchasequarter. Investigating further, the analyst can select theregion 6 node to focus only on region 6 fraud. The fraudfrom other regions is faded out as shown in Figure 7. Theanalyst can quickly see that the outlier comes from cash

advance. This capability to trace the process flow of atransaction is crucial for finding the cause–effect relation-ship of outliers. Using the above information, thecompany is able to place strict control on certain regions

Figure 6 Fraud analysis. (A) Fraud distribution by purchase quarter: Impact factors are quarter, fraud amount and fraud count. Each

line is a transaction, color represents the value of the fraud amount: fraud increases with time (e.g. more red lines in each quarter).

2001-4Q has the highest fraud amount (red label). An outlier: a red line crosses from low fraud count to high fraud amount (other

lines are nearly parallel – high correlation) (B) Fraud distribution by region: Using fraud amount from (A) and choose two other impact

factors: region and sales type. Region 6 (red) has the highest fraud amount (on the top of the circular graph, more red, pink, and

burgundy; less green lines). Most fraud comes from purchase vs cash. Purchase has more red, burgundy, and blue lines than cash.

Figure 7 Finding the cause of the outlier: Region 6 from

Figure 6b is selected. The outlier is seen as a red line crossing

from cash to the amount of $ 28,107,100. The outlier is linked to

region 6 from the left half of the graph. The outlier is also linked

to fraud count 5 and 2000-4Q in Figure 6a. The cause of the

outlier is a cash advance that happened in 2000-4Q, region 6

with a fraud amount of $ 28,107,100 and fraud count of 5.

22

(countries) and credit card usages. After better under-standing the source of the fraud, the company will beable to take preventive action.

Service contract analysisAll businesses have relationships with customers andsuppliers; they execute business processes to obtainservices from suppliers and add value to deliver servicesto customers. Such service processes are usually modeledby service level objectives (SLOs) and contracts,17 stipu-lated between customers and suppliers. A contracttypically contains SLOs defining what service should bedelivered with what level of quality and within what timeperiod. An important question business managers need topursue is whether their business operations are fulfillingthe SLOs. This is a difficult problem, often complicatedby service performance (e.g. response time, server avail-ability) involved in the execution of business operations.

We have applied VisImpact to a real-world, large-scaledata set of service contract analysis in an effort to betterunderstand SLO operational flows, distributions, andanomalies. In this application, the SLO status indicatesthe probability of a contract becoming unfulfilled(violated), with 0 being the most probable and 4 beingthe least probable. The data set contains 10,061 servicetransactions with over 50 SLO impact factors such as SLOstatus, portal response time, search response, month, day,and hour. VisImpact maps nodes to SLO impact factors,lines to service transactions, line widths to the number ofservice transactions, and colors to the values of selectedimpact factors (i.e., search response time). Nodes areplaced in order according to the selected impact factors.

Operational flows and their distribution VisImpact firstabstracts the three most highly correlated factors fromthe impact factor matrix and constructs a circular graphas shown in Figure 8A. The source nodes show SLO status,the intermediate nodes portal response time, and thedestination nodes search response time. Nodes are con-nected with lines from the process flow map. The color ofa line is the value of the search response time (ms).Figure 8a shows that the SLO status is highly impacted byboth portal and search response times. A transaction withhigh values for search response time often has highvalues for portal response time as shown by thenumerous nearly parallel lines. Note that there are majoroutliers, shown by the red lines crossing from a low portalresponse time to a high search response time.

Time dependency VisImpact generates a second circulargraph to show the time dependency as presented inFigure 8b from user domain knowledge. The transactionprocess flows in Figure 8b are tightly linked to SLO statusin Figure 8a. In Figure 8b, the source nodes are month(6,7,8,9), the intermediate nodes are days (1–31), and thedestination nodes are hours (0–23). Figure 8B showssearch response time distribution over time. The colordenotes the search response time.

Process flow relationships between multiple circulargraphs VisImpact helps to discover that the searchresponse time is the potential root-cause of the unful-filled SLOs. This is seen through the linking of processflows across two circular graphs. As shown in Figures 9a–d,we are able to verify this relationship because (a) inFigure 9a and b the higher probability SLO status (e.g.,SLO status 4) is associated with slower response times, as

Figure 8 Service contract process flows and distribution over time. Portal response time and search response time are highly correlated

as seen by nearly parallel lines in (A). Lines with the highest search/portal response times (top 10%) are colored red. Outliers are

shown by lines crossing from low portal response time to high search response time). SLO Status 4 has the highest search response

time in (A) (more red, pink, burgundy). Month 7 and day 21 have the highest search response time – more blue and burgundy in

month 7, most red lines in day 21. High search response times occurred after the 10th day of a month (more red, pink, burgundy, and

blue).

23

seen in the blue and burgundy colors in Figure 9b thatshow the correlation and (b) in Figure 9c and d the lowestprobability SLO status (e.g. SLO status 0) is associatedwith faster response times, as seen in the yellow andgreen colors in Figure 9d.

Detection of anomalies among impact factors One of thekey functions of VisImpact is to detect process flowanomalies including outliers. In Figure 8a, VisImpacthelps to detect outliers, which are shown as thick redlines drawn from high search response times to low portalresponse times. After the analyst selects the red line andfades out all unrelated connections, a serious searchresponse time problem (occurring in month 9, day 21) isclearly shown in Figure 10a and b. The analyst can movethe pointer to drill down to the detail transaction level tofind out that the problem occurred at a time when asearch engine was unavailable, which caused a long

search time for all previously entered transactions. UsingVisImpact as a real-time monitoring system, theseanomalies can be addressed immediately before the SLOviolation probability becomes worse.

Evaluation and comparisonIn this section we evaluate the results achieved byVisImpact and compare our technique to existingapproaches. Of course, it is a difficult task to measurethe quality of visualization results, since a definitive andstrong set of methodologies for measuring the ‘goodness’or the value of a given visualization is still lacking.18 Firstattempts for providing quality measurements for scatter-plots are provided in Wilkinson et al.19 but there is still alot of research in this field necessary. Therefore, ourevaluation focuses on the comparison of VisImpact withother techniques for multivariate data analysis, inparticular scatterplots and parallel coordinates.1

Figure 9 Process flows and relationships between multiple impact factors. Graphs are generated when the analyst selects SLO status 4/

status 0 in Figure 8a. (A) and (B) show that SLO status 4 is associated with higher response times, as seen by blue and burgundy. (C)

and (D) show that SLO status 0 is associated with lower response times as seen by the yellow and green colors. An outlier is detected in

(C) and a highest search response time node day 21 in (D).

24

To investigate the usability of the VisImpact approach,future work will include a user study, as first user feedbackwas very promising. The main advantage of the proposedVisImpact approach is the integration of automatedanalysis techniques into the visualization process tofocus on relevant attributes or groups of attributes inthe underlying multivariate data sets and thus provideabstract presentations of relevant parts of the data. Thisallows it to break complex business operations down togroups of relevant business attributes that allows a betterbusiness analysis. The analysis results are provided usingan intuitive radial visualization layout. The analyst mayinteract with the system for further data analysis, forexample, by using drill-down and roll-up capabilities. Toshow the benefit of the new technique we used the dataset introduced in the section on Service contract analysisand produced scatter-plot matrices and parallel coordi-nates to analyze the data. Both techniques are commonlyused for analyzing multivariate data.

Figure 11 shows a scatterplot matrix of the data setintroduced in the section on Service contract analysis.The diagonal contains the attribute names and histo-grams showing the data distribution. Above the diagonalthe scaled absolute correlation values of the correspond-ing attributes are presented, below the diagonal the datapoints are shown. Since the scatterplot matrix shows asmany scatterplots as there are pairs of parameters, it maybe hard for the user to detect relevant patterns if thenumber of scatterplots is too high.

The figure shows that there is a high correlationbetween the attributes PortalResponseTime and SloStatus(0.76) and between PortalResponseTime and SearchRe-sponseTime (0.43). Assuming that PortalResponseTime isthe measurement of interest and the analyst is interested

in attributes which are highly correlated to this measure,he has to construct a new scatterplot matrix that onlyshows highly correlated attributes, that is, PortalRespon-seTime, SearchResponseTime and SloStatus. In VisImpactthis step is done automatically as shown in Figure 8.

Figure 10 Discover the cause of anomalies (outliers). (A) and (B) are generated when the analyst selects the node on month 9 day 21

(linked with most red lines in (B)). The lines from the anomalies in (A) are linked to the month 9, day 21, and hours (8–14) node in (B).

All unrelated lines are faded out for easy identification. The analyst is allowed to move the pointer on the red lines and nodes to display

transaction record level information, such as finding server availability in this case.

Figure 11 Scatterplot matrix showing a process flow data set

and the correlation of it’s attributes. In the diagonal, histograms

symbolize the data distribution.

25

Additionally, user feedback indicates that the circularlayout in combination with the node ordering options inthe VisImpact layout provides a much easier way for theanalyst to get insight into the relations between theattributes, since owing to overplotting effects scatterplotsare often hard to read without additional interactiontechniques like brushing.

Another common method for analyzing and visualiz-ing multidimensional data sets are parallel coordinates1

and their various extensions. Basic idea of this techniqueis to take all the axis of the multidimensional space andto arrange them in order but parallel to each other.Figure 12 shows the parallel coordinate plot of thebusiness data set introduced in the section on Servicecontract analysis. Experts in the analysis of such parallelcoordinate plots may derive a great deal of data under-standing from these plots, but the interpretation can bestrongly influenced by the order of the axes.20 There existsome tools that improve the usability of parallel coordi-nates plots by providing interaction and analysis techni-ques like the Xmdv tool,21 but they do not take thespecial needs of business analysis processes, described inthe section on Basic idea of VisImpact, into account anddo therefore not provide the same functionality as theVisImpact system. And there is still the drawback that it ishard to identify correlations or clusters between attributeaxes that are not located next to each other. Therefore,the visflow graph visualization focuses on three single oraggregated attributes in each single view and providesautomated techniques and interaction capabilities forselecting them from the available parameter set.

For example, it is difficult to detect relationshipsbetween the three attributes SloStatus, PortalResponsetime and SearchResponse time in Figure 12a, thereforethe user has to rearrange the axis either automaticallybased on a correlation measure as shown in Figure 12b ormanually. VisImpact automates this process, and presentsfor a given task in each single view only the three mostrelevant attributes or groups of them and fades outunrelated attributes. Additionally, the circular flow maplayout provides more space for placing the data points,for example, according to the weight function, assumingthat the diameter of the circle corresponds to the lengthof a single parallel coordinate axis. Ordering techniquesintegrated in VisImpact additionally help to improve thereadability of the resulting visualization for a better visualinformation representation.

ConclusionIn this paper, we presented the VisImpact technique, anew method for business impact visualization. Theapproach uses a new symmetric circular layout to generategraphs. To simplify the complexity of business operations,VisImpact uses correlation analysis, partial matching,cluster, and classifcation analysis techniques to abstractimportant business impact factors. VisImpact presentsdifferent impact factors in multiple graphs to view thedata from different perspectives. We have addressed thespecial node placement and coloring problems to makethe visualizations useful for discovering patterns andexceptions, and find the cause of business problems bytracing process paths of a transaction record.

Figure 12 (A) shows a Parallel Coordinate Plot for the Business data example introduced in the section on service contract analysis,

Color represents Search Response Time with same colormap as in Figure 8. For non-analysis experts it may be difficult to extract similar

information as shown in Figures 8–10 from the Parallel coordinate plots, even with rearranged dimensions according to attribute

correlation shown in (B), due to the missing visualization capabilities provided by VisImpact.

26

We applied the VisImpact technique to real data setsfrom a wide variety of applications, including fraudanalysis and service contract analysis. The currentexperimental studies show significant advantages of theVisImpact technique in comparison to existing techni-ques in performing root-cause analysis. Future work willinclude the application of the VisImpact technique inother areas such as capacity planning and businessfinancial activities.

AcknowledgmentsWe thank Beth Keer for her encouragement and suggestions;Jorn Schimmelpleng, Stephane Sabiani, Akhil Sahai, Fabio

Casati from HP for providing Service Level Management

techniques and SLM data; Brian Thesing for the commentsand application data, Manish Bhardwaj and Shashidhar

Iddomsetty from HP Consulting and Services for their

comments and data.

References1 Inselberg A, Dimsdale B. Parallel coordinates: a tool for visualizing

multi-dimensional geometry. In: Proceedings of the 1st Conference onVisualization ’90 (VIS ’90). IEEE Press: Los Alamitos, CA, USA, 1990;361–378.

2 Eick SG, Steffen JL, Sumner EE. Seesoft – a tool for visualizing lineoriented software statistics. IEEE Trans Softw Eng 1992; 18: 957–968.

3 Hao MC, Dayal U, Keim DA, Schneidewind J. Visbiz: a businessprocess visualization case study. In: Proceedings of the Eurographics/IEEE-VGTC Symposium on Visualization, EuroVis 2005. Leeds, UK,2005.

4 Kaufmann M, Wagner D. Drawing Graphs: Methods and Models.Springer Verlag: Berlin, 2001.

5 Chuah MC, Eick SG. Managing software with new visual representa-tions. In: Proceedings of the 1997 IEEE Symposium on InformationVisualization (InfoVis ’97). IEEE Computer Society: Washington, DC,USA, 1997; 30.

6 Agrawal R, Ling KI, Sawhney HS, Shim K. Fast similarity search in thepresence of noise, scaling, and translation in time-series databases. In:Proceedings of the 21st International Conference on Very Large DataBases (VLDB’95). Morgan Kaufmann Publishers Inc: San Francisco,CA, USA, 1995; 490–501.

7 Berchtold S, Keim DA, Kriegel H-P. Using extended feature objects forpartial similarity retrieval. The VLDB Journal 1997; 6: 333–348.

8 Mihael Ankerst, Stefan Berchtold, Daniel A.Keim. Similarity clusteringof dimensions for an enhanced visualization of multidimensionaldata. In: Proceedings of the 1998 IEEE Symposium on InformationVisualization (INFOVIS’98) 1998. IEEE Press: Los Alamitos, CA, USA,52.

9 Yang J, Wang W, Yu P. Mining asynchronous periodic patterns in timeseries data. In: Proceedings of the ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD). ACM Press:New York, NY, USA, 2000; 275–279.

10 Faloutsos C, Ranganathan M, Manolopoulos Y. Fast subsequencematching in time-series databases. In: Proceedings of the 1994 ACMSIGMOD international conference on Management of data (SIGMOD‘94). ACM Press: New York, NY, USA, 1994; 419–429.

11 Han J, Dong G, Yin Y. Efficient mining of partial periodic patterns intime series database. In: Proceedings of the 15th International

Conference on Data Engineering (ICDE’99). IEEE Computer Society:Washington, DC, USA, 1999; 106.

12 Hinneburg A, Keim DA. Clustering techniques for large data sets fromthe past to the future. In: KDD ‘99: Tutorial Notes of the Fifth ACMSIGKDD International Conference on Knowledge Discovery and dataMining. ACM Press: New York, NY, USA, 1999; 141–181.

13 Hinneburg A, Keim DA. An efficient approach to clustering in largemultimedia databases with noise. In: The Fourth InternationalConference on Knowledge Discovery and Data Mining (KDD’98) 1998;58–65.

14 Eades P, Wormald NC. Edge crossings in drawings of bipartitegraphs. Algorithmica 1994; 11: 379–403.

15 Sugiyama K. Methods for visual understanding of hierarchical systemstructures. IEEE Transactions on Systems, Man, and Cybernetics 1981;1: 109–125.

16 Hao MC, Dayal U, Hsu M, Baker J, Deletto B. A Java-based visualmining infrastructure and applications. In: Proceedings of the 1999IEEE Symposium on Information Visualization (INFOVIS ‘99). IEEEComputer Society: Washington, DC, USA, 1999; 124.

17 Sahai A, Machiraju V, Sayal M, Van Moorsel AP, Casati F. AutomatedSLA monitoring for web services. In: Proceedings of the 13th IFIP/IEEEInternational Workshop on Distributed Systems: Operations andManagement (DSOM ‘02). Springer-Verlag: London, UK, 2002;28–41.

18 Miller N, Hetzler B, Nakamura G, Whitney P. The need for metrics invisual information analysis. In: Workshop on New paradigms inInformation Visualization and Manipulation (NPIV ‘97). ACM Press:New York, NY, USA, 1997; 24–28.

19 Wilkinson L, Anand A, Grossman R. Graph-theoretic scagnostics. In:Proceedings of the IEEE Symposium on Information Visualization(INFOVIS’05). IEEE Press: Washington, DC, USA, 2005; 21.

20 Robert Spence. Information Visualization, 1st edn. ACM Press:Addison-Weslay, 2000.

21 Rundensteiner EA, Ward MO, Yang J, Doshi PR. Xmdvtool: visualinteractive data exploration and trend discovery of high-dimensionaldata sets. In: Proceedings of the 2002 ACM SIGMOD internationalconference on Management of data (SIGMOD ‘02). ACM Press:New York, NY, USA, 2002; 631–631.

27

business process impact visualization and anomaly detection

Documents