process performance modelling statistical probabilistic

PROCESS PERFORMANCE MODELS – STATISTICAL ,

PROBABILISTIC & SIMULATION

Vishnu Varthanan Moorthy

and Team

31 Jul 2014

MODELLING IN CMMI

In CMMI Model, the Process Area ‘Organization Process Performance’ callsfor useful Process Performance Model (PPM)s establishment (& calibration)and Quantitative Project Management and Organizational PerformanceManagement process areas gets more benefit by using these models topredict or to understand the uncertainties , thereby helping in reducing riskby controlling relevant process/sub processes.

The PPM’s are built to predict the Quality and Process PerformanceObjectives and sometimes to Business Objectives (using integrated PPMs)

Modelling plays a vital role in CMMI in the name of Process PerformanceModels. In fact we have seen Organizations decide on the goals andimmediately starts looking at what is their Process Performance Model. Itsalso because of lack of options and clarity, considering in software thedata points derived are smaller in nature and also because of processvariation.

WHAT ARE THE CHARACTERISTICS OF A GOOD PPM?

One or more of the measureable attributes represent controllableinputs tied to a sub process to enable performance of ―what-ifanalyses for planning, dynamic re-planning, and problem resolution.

Process performance models include statistical, probabilistic andsimulation based models that predict interim or final results byconnecting past performance with future outcomes.

They model the variation of the factors, and provide insight into theexpected range and variation of Predicted results.

A process performance model can be a collection of models that(when combined) meet the criteria of a process performance model.

THE ROLE OF SIMULATION & OPTIMIZATION

Simulation:

Its an activity of Studying the virtual behaviour of a system using therepresentative model/miniature by introducing expected variations in themodel factors/attributes.

Simulation helps us to achieve confidence on the results or to understandthe uncertainty levels

Optimization:

In the context of Modelling, Optimization is a technique in which themodel outcome can be maximized/minimized or targeted by introducingvariations in the factors (with/without constraints) and using relevantDecision rules. The Values of factors for which the outcome meets thepossible expected values are used as target for planning/composingprocess/sub process. This helps us to plan for success.

TYPES OF MODELS - DEFINITIONS

Physical Modelling :

The Physical state of a system is represented using the scaled dimensionswith/without similar components. As part of Applied Physics we could see suchmodels coming up often. Example: Prototype of a bridge, a Satellite map, etc

Mathematical Modelling:

With the help of data the attributes of interest are used to form therepresentation of a system. Often these models are used when people involvedlargely in making the outcome or the outcome is not possible to be replicated inlaboratory. Example : Productivity model, Storm Prediction, Stock marketprediction, etc

Process Modelling:

The Entire flow of Process with factors and conditions are modelled. Often thesemodels are useful in understanding the bottlenecks in the process/System and tocorrect. Ex: Airport queue prediction, Supply chain prediction, etc

TREE OF MODELS

Models

Mathematical

Statistical

Probabilistic

Simulation

Process Model

Discrete Event

Continuous

Physical Model

Others

Static

Dynamic

Vs

Deterministic

Stochastic

Vs

Discrete

Continuous

Vs

Variations in the

Models

Regression Models,

Artificial Neural

Network, ARIMAX,

Reliability

Bayesian Belief

Network, Markov

Network, PNN

Parametric

Nonparametric

Vs

Monte Carlo

Simulation, Discrete

event Simulation

Queuing Model +

Discrete Event

Simulation

System Dynamic

models

Examples

We often observe there is mix of techniques or combination of models used to bring

the less error prone Models to predict the system

PROCESS OF MODELLING

Model

Objective

Limited to Mathematical and Process Modelling

Collection of Data

on Relevant

Factors/componen

ts

Formulation of

Representation

using Techniques

Prediction of

Parameters

Compare with

Actuals

(Validation)

Refine the

Representation

Refine the

Factors/Compone

nts

Use for Prediction

With

Time

&

Data

Yes

No

MODELLING UNDER OUR PURVIEW

We will see the following models in this presentation

Regression Based Models

Bayesian Belief Networks

Neural Networks

Fuzzy Logic

Reliability Modelling

Process Modelling (Discrete Event Simulation)

Monte Carlo Simulation

System Dynamics (Continuous Simulation)

REGRESSION

Regression is a process of estimating relationship among the dependant andindependent variables and forming relevant explanation of for dependant variablewith the conditional values of Independent Variables.

As a model its represented using Y=f(X)+error (unknown parameters)

Y – dependent Variable , X –Independent Variables

Few assumptions related to regression,

sample of data represents the population

The variables are random and their errors are also random

There is no multicollinearity (Correlation amongst independent variables)

We are working on here with multiple regression (with many X’s) and assuming linearregression (non linear regression models exist).

The X factors are either the measure of a sub process/process or it’s a factor which isinfluential to the data set/project /sample.

Regression models are often Static models with usage of historical data coming out frommultiple usage of processes (many similar projects/activities)

REGRESSION - STEPS

Perform a logical analysis (ex: Brainstorming with fishbone) tounderstand the independent variables (X) given a dependent variable (Y)

Collect relevant data and plot scatter plots amongst X vs Y and X1 Vs X2and so on.. This will help us to see if there is relationship (correlation)between X and Y, also to check on multicollinearity issues.

Perform subset study to understand the best subset which gives higher R2

value and less standard error

Develop a model using relevant indications on characteristics of data withcontinuous and categorical data

From the results study the R2 value (greater than 0.7 is good) whichexplains how much the Y is explained by X’s . The more the better.

Study the P values of Individual independent variables and it should beless than 0.05, which means there is significant relationship is there with Y.

REGRESSION - STEPS Study the ANOVA Resulted P value to understand the model fit and it should be lessthan 0.05

VIF (Variance Inflation Factor) should be less than 5 ( sample size less than 50) else lessthan 10, on violation of this multicollinearity possibility is high and X factors to berelooked

Understand the residuals plot and it should be normally distributed, which means theprediction equation produces a line which is the best fit and gives variation on eitherside.

R2 alone doesn’t say a model is right fit in our context, as it indicates the Xs are prettymuch relevant to the variation of Y, but it never says that all relevant X’s are part of themodel or there is no outlier influence. Hence beyond that , we would recommend tovalidate the model.

Durbin Watson Statistic is used for checking Autocorrelation using the residuals, and it’svalue ranges from 0 to 4. 0 indicates strong positive autocorrelation (previous data,impacts the successive time period data to increase) and 4 indicates strong negativeautocorrelation (previous data, impacts the successive time period data to decrease)and 2 is no serial correlation.

REGRESSION - EXAMPLE

Assume a case where Build Productivity is Y, Size (X1), Design Complexity(X2) and Technology (X3 –Categorical data) are forming a model as the organization believes they are logically correlated.They collect data from 20 projects and followed the steps given in the earlier slide and formed aregression model and following are the results,

Dummy Variable Regression used, because Categorical

Data ‘Technology’ given and 0 and 1 used in regression

Lower P Value for Regression means

the model is significant

Higher R2 adjusted, means all X’s contributes in explanation of

Y, and to 86.44% and Low Standard Error

Lower P Value for all Independent Variables

means they are significant in prediction

Lower VIF (<5) indicates no

multicollinearity risk

By Technology, two regression equations

are formed

Residual Plot is

approximately Normally

Distributed

Tool: Minitab outputs

VALIDATING MODEL ACCURACY

Its important to ensure the model which we develop not only represents the system, but also has theability to predict the outcomes with less residuals. In fact this is the part where we can actuallyunderstand whether the model meets the purpose.

To check the Accuracy we can use the commonly used method MAPE (Mean Absolute Percentage Error),which calculates the percentage error across observations between the actual value and predicted value.

where Ak is the actual value and Fk is the forecast value. An error value of less than 10% is acceptable.However if the values of forecasted observations are nearer to 0, then its better to avoid MAPE andinstead use Symmetric Mean Absolute Percentage Error(SMAPE).

Interpolation & Extrapolation:

Regression models are developed using certain range of X values and the relationship holds true forwithin that region. Hence any data prediction , within the existing range of Xs (Interpolation)would meanwe can rely on the results more. However the benefit of a model also relies on its ability to predict asituation which is not seen yet, in that cases, we expect the model to predict a range which it neverencountered or the region in which the entire relationship or representation could significantly changebetween X’s and Y, which is extrapolation. To a smaller level extrapolation can be considered withuncertainty in mind, however larger variation of Xs , which is far away from the data used in developingthe model can be avoided as the uncertainty level increases.

VARIANTS IN REGRESSION

Statistical relationship modelling are mainly selected based on the type of data which we have with us. The X factors and Y factors are continuous or discrete determines the technique to be used in developing the statistical model.

X's

Y

Data Type Continuous Discrete

Discrete ANOVA & MANOVAChi-Square & Logit

Continuous

Correlation &

Regression

(simple/multiple/

CART, etc) Logistic Regression

Few Discrete +

Few Continuous

Dummy Variable

Regression ordinal Logit

By linearity , we can classify a regression as linear, quadratic, cubic or exponential.

Based on type of distribution in the correlation space, we can use relevant regression

model.

TOOLS FOR REGRESSION

Regression can be performed using Trendline functions of MS excel easily. In addition there aremany free plug-ins available in the internet.

However from professional statistical tools point of view, Minitab 17 has easy features for users toquickly use and control. The tool has added profilers and optimizers which are useful for simulationsand optimizations (earlier we were depending on external tools for simulation).

SAS JMP is another versatile tool with loads of features. If someone has used this tool for quitesometime, they will be more addictive with its level of details and responsiveness. JMP hadinteractive profilers for quite a long period and can handle most of the calculations.

In addition, we have SPSS, Matlab tools which are also quite famous.

R is the open source statistical package which can be added with relevant add-ins to develop manymodels.

We would recommend to consider the experience & competency level of users, licensing cost,complexity of modelling and ability to simulate & optimize in deciding the right tool.

Some organizations decide to develop their own tools , considering their existing source of data is inother formats, however we have seen such attempts rarely sustain and succeed. This is because, toomuch elapsed time, priority changes, complexity in algorithm development, limited usage, etc.Considering most of the tools support common formats, the organizations can consider to developreports/data in these formats to feed in to proven tools/plug ins (Its just a word of free advice )

BAYESIAN BELIEF NETWORKS

A Bayesian Network is a construct in which the probabilistic relationship betweenvariables are used to model and calculate the Joint Probability of Target.

The Network is based on Nodes and Arcs (Edges). Each variable represents a Nodeand their relationship with other Node is expressed using Arcs. If any given node isconnected with a dependent on other variable, then it has parent node. Similarly ifsome other node depends on this node, then it has children node. Each node carriescertain parameters (ex: Skill is a node, carries High, Medium, Low parameters) and theyhave probability of occurrence (Ex: High- 0.5, Medium -0.3,Low -0.2). When there isconditional independence (node has a parent) then its joint probability is calculated byconsidering the parent nodes(ex: Analyze Time being “Less than 4 hrs” or more,depends on Skill High/Med/Low, which is 6 different probability values).

The central idea of using this in modelling is based on the posterior probability can becalculated from the prior probability of a network , which has developed with thebeliefs (learning). Its based on Bayes Theorem.

Bayesian is used highly in medical field, speech recognition, fraud detection, etc

Constraints: The Method and supportive learning needs assistance and computationalneeds are also high. Hence its usage is minimal is IT Industry, however with relevant toolsin place its more practical to use in IT.

BAYESIAN BELIEF NETWORKS- STEPS

We are going to discuss on BBN mainly using BayesiaLab tool, which has all theexpected features to make comprehensive model and optimize the network andindicate the variables for optimization. We can discuss on other tools in upcoming slide.

A) In Bayesian, data of variables can be in discrete or continuous form, however theywill be discretised using techniques like Kmeans/Equal Distance/Manual &otherMethods.

B) Data has to be complete for all the observations in the data set for the variables,else the tool helps us to fill the missing data

C)Structure of the Network is important and it determines the relationship betweenvariables, however it doesn’t often the cause and effect relationship instead adependency. Domain experts along with process experts can define the structure (withrelationship) manually.

D) As a alternative, machine learning is available in the tool, where set of observationspassed to the tool and using the learning options (structured and unstructured) the toolplots the possible relationships. The tool uses the MDL (Minimum Description Length) toidentify the best possible structure. However we can logically modify the flow, byadding/deleting the Arcs (then, perform parameter estimation to updated theconditional probabilities)

BAYESIAN BELIEF NETWORKS- STEPS

E) In order to ensure that the network is fit for prediction, we have to check the networkperformance. Normally this is performed using test data (separated from set of overalldata) and use it to check the accuracy, otherwise the whole set is taken by tool tovalidate the model predicted values vs actual value. This gives the accuracy of thenetwork in prediction. Anything above 70% is good for prediction.

F) In other models we will perform simulation to see the uncertainty in achieving atarget, but in probability model that step is not required, as the model directly givesprobability of achieving.

G)In order to perform what if and understand the role each variable in maximizing theprobability of target or mean improvement of target, we can do target optimization.This helps us to run number of trials within the boundaries of variation and see the bestfit value of variables which gives high probability of achieving the target. Using thisvalues we can compose the process and monitor the sub process statistically.

H) As we know some of the parameters with certainty, we can set hard evidence andcalculate the probability. (Ex: Design complexity or skill is a known value, then they canbe set as hard evidence and probability of productivity can be calculated.)

i) Arc Influence diagram will help us in understanding the sensitivity of variables indetermining the Target.

BAYESIAN - SAMPLEAssume a case in which we have a goal of Total Turn Around Time (TTAT) with parameters Good(<=8hrs)and bad(>8hrs). The variables which is

having influence are Skill, KEDB(Known Error Database) Use and ATAT (Analyse Turn Around Time) with Met(<=1.5 hrs) and Not met (>1.5hrs),

How do we go with Bayesia modelling based on previous steps. (Each incident is captured with such data and around 348 incidents from a

project is used)

Directed Acyclic Graph (DAG) or

the Network with Nodes (Skill, etc)

and Arcs (connectors) with TTAT

Probability of Each Variables and

their parameter using conditional

and Joint Probability

Total Precision is 66.38%, which is

the actual vs model predicted

value, in this case its marginal to

accept

Ex: Actual count of Bad is

149 in model and the

predicted times are 104

The Current probability of Good State of TTAT is

57.46% (Refer first pic) and after optimization the

Optimal Probability is 70% (for ATAT 0 refer s to

met, skill 1 is High and KEDB 1 is Yes

Target Optimization is

set of Maximizing the

probability

BAYESIAN TOOLS

There are few tools few have worked on to get hands on experience. Onselecting a tool for Bayesian modelling its important to consider that thetool has ability to machine learn, analyze and compare networks andvalidate the models. In addition the tool to have optimization capabilities.

GENIE is a tool from Pittsburgh University, which can help us learn themodel from the data. The Joint probability is calculated in the tool andusing hard evidence we can see the final change in probabilities.However the optimization part (what if) is more of trial and error and notperformed with specialized option.

We can use excels and develop the joint probabilities and verify withGENIE on the values and accuracy of the Network. The excel sheet can beused as input for simulation and optimization with any other tool (ex:Crystal ball) and what if can be performed. For sample sheets pleaseconnect with us in our mail id given in contact us.

In addition we have seen Bayes Server, which is also simpler in makingthe model, however the optimization part is not as easy we thought of.

NEURAL NETWORK

In general we call it “Artificial Neural Network (ANN)” as it performs similar to humanbrain neurons (simpler version of it). The network is made of Input nodes, output nodeswhich are connected through hidden nodes and links(they carry weightage). Like humanbrain trains the neuron by various instances/situations and designs its reaction towardsit, the network learns the input and its reaction in output, through algorithm and usingmachine learning.

There are single layer feed forward, multilayer feed forward and recurrent layernetwork architecture exists. We will see the single layer feed forward in this case.Single layer of nodes which uses inputs to learn towards outputs are single layer feedforward architecture.

In Neural Network we need the network to learn and develop the patters and reducethe overall network error. Then we will validate the network using a proportion of datato check the accuracy. If the learning and validation total mean squared error is less(Backpropogation method-by forward and backward pass the weights of the link areadjusted, recursively) then the network is stable.

In general we are expected to use continuous variable, however discrete data is alsosupported with the new tools. Artificial Neural Networks is a black box technique wherethe inputs are used to determine the outputs but with hidden nodes, which can’t beexplained by mathematical relationships/formulas. This is a non-linear method whichtends to give better results than other linear models.

NEURAL NETWORKS - STEPSWe are going to explain neural networks using JMP tool from SAS. As we discussed inregression, this tool is versatile and provides detailed statistics.

A) Collect the data and check for any high variations and see the accuracy of it.

B) Use the Analyze->modeling->Neural from the tool and provide X and Y details. InJMP we can give discrete data also without any problem.

C) In the next step we are expected to specify the number of hidden nodes we want tohave. Considering the normal version of JMP is going to allow single layer of nodes, wemay specify as a rule of thumb (count of X’s * 2).

D)We need to specify the method by which the data will be validated, here if we haveenough data (Thumb Rule: if data count> count of x’s * 20) then we can go ahead with‘Holdback’ method, where certain percentage of data is kept only for validation of thenetwork, else we can use Kfold and give to give number of folds (each fold will be usedfor validation also). In Holdback method keep 0.2 (20%) for validation.

E) We get the results with Generalized Rsqaure, and here if the value is nearer to 1means, the network is contributing to prediction (the variables are able to explain wellof the output , using this neural network). We have to check the validation Rsquare alsoto check how good is the results. Only when the training and validation results arenearly the same, the network is stable and we can use for prediction. In fact thevalidation results in a way gives the accuracy of the model and their error rate iscritical to be observed.

NEURAL NETWORKS - STEPS

F) The Root Mean Squared Error to be minimum. Typically you can compare the fitmodel option given in JMP which best fits the linear models and compare their Rsquarevalue with Neural Networks outcome.

G) The best part of JMP is its having interactive profiler, which provides information ofX’s value and Y’s outcome in a graphical manner. We can interactively move the valuesof X’s and we can see change in ‘Y’ and also change in other X’s reaction for that pointof combination.

H) With this profiler there is sensitivity indicator(triangle based) and desirabilityindicator. This acts as optimizer, where we can set the value of “Y” we want to have withSpecification limits/graphical targets and for which the X’s range we will be able toget with this. There is maximization, minimization and target values for Y.

I) Simulation is available as part of profiler itself and we can fix values of X’s (withvariation) and using montecarlo simulation technique the tool provides simulation results,which will be helpful to understand the uncertainties.

NEURAL NETWORKS - SAMPLEAssume a case in which we have a goal of Total Turn Around Time (TTAT) (Less than 8hrs is target). The variables whichis having influence are Skill (H,M,L), KEDB(Known Error Database) Use (Yes, No) and ATAT (Analyse Turn Around Time),How do we go with Neural Networks based on previous steps. (Around 170 data points collected from project is used)

In this case, Skill and KEDB are

discrete, ATAT and TTAT is continuous.

Since we give ‘TTAT’ as Y, the machine

performs structured learning.

Here we are using Holdback

method for validation and

giving 20% of data used for

validation and giving 6 hidden

nodes for activation function.

Rsqaure for training and

Validation is more than .9, which

means the Y is explained well by

this network and its stable

Structure of the Inputs, Hidden

node/activation function and output.

RMSE is less compared to

other models

The simulation function is set up

with 5000 runs, and Spec limits for

Y is given (output table is possible)

Each variable can be interactively

specified with random or fixed

value. This helps in prediction with

known/unknown value

Desirability function shows the specs

of X’s where TTAT achieves the

expected value and simulation shows

the confidence level

NEURAL NETWORK TOOLS

Matlab has neural network toolbox and which seems to be user friendly and has manyoptions and logical steps to understand and improve the modelling. What we are notsure is the simulation and optimization capabilities. The best part is they give relevantscripts which can modified or run along with existing tools.

JMP has limitations when it comes to Neural Network as only single layer of hiddennetwork can be created and options to modify learning algorithm are limited. HoweverJMP Pro has relevant features with many options to fit our need of customization.

Minitab at this moment don’t have neural networks in it. However SPSS tool containsneural network with multilayer hidden nodes formation capabilities.

Nuclass 7.1 is a free tool (professional version has cost) which is specialized in NeuralNetwork. There are many options available for us to customize the model. However itwon’t be as easy like JMP or SPSS.

PEERForecaster and Alyuda Forecaster are excel based neural network forecastingtools. They are easy to use to build the model, however the simulation and optimizationwith controllable variable is question mark with these tools.

RELIABILITY MODELLING

Reliability is an attribute of software product which implies the probability to performat expected level without any failure. The longer the software works without failure, thebetter the reliability. Reliability modelling is used in software in different conditions likedefect prediction based on phase-wise defect arrival or testing defect arrival pattern,warranty defect analysis, forecasting the reliability, etc. Reliability is measured in ascale of 0 to 1 and 1 is more reliable.

There is time dependent reliability, where time is an important measure as the defectoccurs with time, wear out, etc. There is also non-time dependent reliability, in this casethough time is a measure which communicates the defect, the defect doesn’t happen justby time but by executing faulty programs/codes in a span of time. This concept is usedin software industry for MTTR (Mean Time To Repair), Incident Arrival Rate, etc.

Software reliability models normally designed with the distribution curve which depictsthe shape where defect identification/arrival with time reduces from peak towards alow and flatter trajectory. The shape of the curve is the best fit model and mostcommonly we use weibull, logistic, lognormal, Small extreme value probabilitydistributions to fit. In software its also possible that every phase or period might behaving different probability distributions.

Typically the defect data can be used in terms of count of defects in a period (ex: 20/40/55 in a day) or defect arrival time (ex: 25, 45, 60 minutes difference in whicheach defect entered). The PDF (Probability Distribution Function) and CDF ( CumulativeDistribution Function) are important measures to understand the pattern of defects andto predict the probability of defects in a period/time, etc.

RELIABILITY MODELLING- STEPSWe will work on Reliability again using JMP, which is pretty for these type of modelling.We will apply reliability to see the defects arrival in maintenance engagement, wherethe application design complexity and skill of people who are maintaining the softwarevaries. Remember when we develop a model, we are talking about somethingcontrollable is there, if not these models are only time dependent ones and can onlyhelp in prediction but not in controlling.

In reliability we call the influencers as Accelerator, which impacts the failure. We canuse weights of defects or priority as frequency and for the data point for which we arenot sure about time of failure, we use Censor. Right censor is for the value for which youknow only the minimum time beyond which it failed and left censor is for maximum timewithin which it failed. If you know the exact value, then by default its uncensored. Thereare many variants within reliability modelling, here we are going to use only Fit life byX modelling.

A) Collect the data with defect arrival in time or defect count by in time. In this case weare going to use Life fit by X, so we can collect it by time between defects. Also updatethe applications complexity and team skill level along with each data entry.

B) Select “Time to Event” as Y and select the accelerator(complexity measure) and useskill as separator.

C) There are different distributions which are categorized by the applicationcomplexity is available. Here we have to check the Wilcoxon Group Homogeneity Testfor the P value (should be less than 0.05) and ChiSquare value (should be minimal).

RELIABILITY MODELLING- STEPS

D) To select the best fit distribution, look at the comparison criteria given in the tool,which shows -2logliklihood, AICc, BIC values. Here AICc (Corrected Akaike’s InformationCriterion) should be minimal for the selected Distribution. BIC is Bayesian InformationCriterion, which is more stricter as it takes the sample size in to consideration. ( In othertools , we might have Anderson Darling values, in that case select the one which hasvalue less than or around 3 or the lowest )

E) In the particular best fit distribution, study the results for P-value, see the residual plot(Cox-Snell Residual P-plot) for their distribution.

F) Quantile Tab in this tool is used for extrapolation(ex: in minitab, we can provide newparameters in a column and predict the values using estimate option) and for predictingthe probability.

G) The variation of accelerator can be configured and probability is kept normally at0.5 to see that 50% of chance or to be in the median and then the expected Mean timecan be kept as LSL and/or USL accordingly. The simulation results will tell us the Meanand SD, with graphical results.

H) For Optimization on maintaining the Accelerator, we can use Set desirability functionand can give a target for “Y” and can check the values.

I) Under Parametric survival option in JMP , we can check the probability of a defectarrival in a given time, using Application complexity and Skill level.

RELIABILITY MODELLING- SAMPLELets consider the previous example where the complexity of applications are maintained at different level (controllable ,assuming the code and design complexity is altered with preventive fixes and analysers) and that’s an accelerator for defectarrival time (Y) and skill of the team also plays a role (assuming the applications are running for quite sometime and many fixesare made). In this case, we want to know the probability of having mean time arrival of defect/incident beyond 250 hrs…

Select Y and X’s. Select Default

Arrhenius Celsius as relationship

between X and Y

The best distribution is ordered based

on AICc value in JMP. You can select

Non parametric, if the fit is not proper.

Check the P Value and

Chisquare

Check the P Value and Chisquare Value of the

best distribution (here it’s the weibull). Study the

Cox-Snell residual plot for normality and in this

case its normal, so it’s a good fit.

Select the Quantile profiler and see the code complexity variation and

probability level desirability for the Y Value. We can simulate and see for the

given X and probability , where the confidence Interval of data falls.

Using Parametric Survival plot(another option under

reliability), we have estimated for a sample value

0.54 code complexity and skill 3 and 4 , what is the

probability of mean time 200… and the probability

of survival (or happening) is more for skill 3 .

RELIABILITY MODELLING- TOOLS

Minitab also has reliability modelling and can perform almost all types of modellingwhich other professional tools offer. For the people who are convenient with minitab canuse these options. However we have to remember that simulation and optimization isalso a need for us in modelling in CMMI, so we may need to generate outputs andcreate ranges and simulate and optimize using Crystal ball (or any simulation tool).

Reliasoft - RGA is another tool with extensive features in reliability modelling. Itscomparatively user friendly tool. It’s a tool worth a try if reliability is our key concern.

R- though we don’t talk much about this free statistical package, it comes with loads ofadd on package for every need. We have never tried, may be because we are lazyand don’t want to go out of comfort from GUI abilities of other professional tools.

CASRE and SMERFS are free tools ,which we have used in some context. However wenever tried the Accelerators with these tools, so we are not sure are they having theoption of life fit by X modelling. However for reliability forecasting and growth theyare useful at no cost.

Matlab statistics tool box also contains reliability modelling features. SPSS reliabilityfeatures are good enough to use for our needs in software Industry. However JMP isgood from the point, that you only need one tool which gives modelling, simulation andoptimiation.

PROCESS MODELLING (QUEUING SYSTEM)Queuing system is a one in which the entity arrival creates demand and it has to beserved by limited resources assigned in the system. The system distributes its resourcesto handle various events in the system at any given point in time. The events arehandled as discrete events in the system.

There are number of queuing systems can be created, however they are based onarrival of elements, servers utilization, wait time/time spent in the system flows(between servers and with the servers). Discrete events helps the queuing model tocapture the time stamps of different events and model their variation along with thequeue system.

This model helps to understand the resource utilization of servers, bottlenecks in thesystem events, idle time, etc. Discrete Event Simulation with Queue is used in manyplaces like banks, hospitals, airport queue management, manufacturing line, supplychain ,etc.

In software Industry we can use in application maintenance incident/problem handling,Dedicated service teams /functions(ex: estimation team, technical review team,Procurement, etc), Standard change Request handling and in many contexts where thearrival rate and team size plays a role in delivering on time.

We also need to remember that in software context the element which comes in queuewill be there in queue till its serviced and then it departs, unlike in a bank or hospitalwhere a patient come late to the queue may not be serviced and they leave the queue.

PROCESS MODELLING -STEPSWe will discuss the Queuing system modelling using the tool “Processmodel”.

Setting up flow:

A) Its important to understand the actual flow of activities and resources in a system and then making a graphicalflow and verifying it.

B) Once we are sure about the graphical representation, we have to provide the distribution of time, entity arrivalpattern, resource capacity and assignment, input and output queue for each entity. These can be obtained by Timemotion study of the system for the first time. The tool has Stat-fit, which will help to calculate the distributions.

C) Now the system contains entity arrival in a pattern with this by adding storage the entities will be retained tillthey get resolved. Resources can be given in shifts and by using get and free functions (we can code in a simplemanner) and by defining scenarios (the controllable variables are given as scenario and mapped with values) theirusage conditions can be modified to suit the actual conditions.

Simulation:

D) The system can be simulated with replications (keep around 5) and for a period of 1 month or more (a monthcan help in monitoring and control with monthly values(

E) The simulation can be run with or without animation. The results are displayed as output details. The reports canbe customized by adding new metrics and formulas.

F) The output summary containing “Hot Spot” refers to idle time of entities or waiting time in queue. This isimmediate area to work on process change and improve the condition. If there is no Hot Spot , we need to studythe activity which has High Standard deviation or High Mean or both of individual activities and they become ourcritical sub processes to control.

PROCESS MODELLING -STEPS

Validating Results:

G) Its important to validate, whether the system replicates the real life condition by comparing theactuals with predicted values of the model. We can use MAPE and the difference should be less than10%.

Optimization:

H) In order to find the best combination of resource assignment ( ex: with variation in skill and count)with different activities, we can run “SimRunner”. The scenarios which we defined earlier are goingto be the controllable factors and a range (LSL and USL) is provided in the tool, similarly theobjective could be to minimize the resource usage and increase entity servicing or reducing elapsedtime, which can be set in tool.

J) The default value of convergence, simulation length can be left as it is and the optimization isperformed. The tool tries various combination of scenario value with existing system and picks theone which meets our target. These values (activity and time taken, resource skill, etc) can be used forcomposition of processes.

PROCESS MODELLING -VALIDATIONIn a Maintenance Project they are receiving different severity incidents (P1,P2,P3,P4) and their count is around 100 in aday with hourly variation and there are 2 shifts with 15 people each (similar skill). The different activities are studied andtheir elapsed time, count etc are given as distributions (with mean, S.D, median and10%,90% value). The Project team wantto understand their Turn Around Time and SLA meeting. They also want to know their bottlenecks and which process tocontrol?

Flow configuring is performed within the

tool window and distribution for activities

are set along with resource assignment

The window provides

information of entities

processed and time taken

,etc, it can be customized

2 replications and 720 hrs has been

set in this case. The replications give

distribution to us for different measures

Entities movement can be studied and

simulation can be made slow or fast to

see what happens in a period, visually

In this case there is no Non-value added

Hotspot, so we will have to monitor and

control the Apply Fix step of Priority 2

tickets

Resolution and Response percentage of SLA’s

are predicted in this case. By selecting both

replications, we can see the variation also.

PROCESS MODELLING -TOOLS

The tools of mathlab, SAS JMP has their own process flow building capabilities.However specific to queuing model , we have seen BPMN process simulation tool, whichis quite exhaustive and used by many. The tool has the ability to build and simulate themodel.

ARIS simulation tool is also another good tool to develop process system and performsimulation.

While considering the tools we also needs to see the optimization capabilities of thetools , without which we have to do many trail and error for our what if analysis.

FUZZY LOGIC

Fuzzy Logic is a representation of a model in linguistic variable and handling thefuzziness/vagueness of their value to take decisions. It removes the sharp boundaries todescribe a stratification and allows overlapping. The main idea behind Fuzzy systems isthat truth values (in fuzzy logic) or membership values are indicated by a value in therange [0,1] with 0 for absolute falsity and 1 for absolute truth.

Fuzzy set theory differs from conventional set theory as it allows each element of a givenset to belong to that set to some degree (0 to 1), unlike in conventional method the elementeither belongs to or not. For example if we calculated someone’s skill index as 3.9 and wehave medium group which contains skill 2.5 to 4 and High group which contains 3.5 to 5. Inthis case the member is part of , Medium group has around 0.07 degree and High grouparound 0.22 (not calculated value). This shows the Fuzziness. Remember this is notprobability but its certainty which shows degree of membership in a group.

In Fuzzy logic the problem is given in terms of linguistic variable, however the underlyingsolution is made of mathematical(numerical) relationship determined by Fuzzy rules (usergiven). For example, if Skill level is high and KEDB usage is High, then Turn AroundTime(TAT) is Met is rule, for setting up this rule, we should study to what extent this hashappened in the past. At the same time this will also be a part in Not met group of TAT toa degree.

In software we use Fuzziness of data (overlapping values) and not exactly the Fuzzy rulesbut we allow mathematical/stochastic relationship to determine the Y in most cases. Wecan say a partial application of Fuzzy logic with monte carlo simulation.

FUZZY LOGIC- SAMPLETo understand the Fuzzy logic, we will use the tool qtfuzzylite in this case. Assume that a project is usingdifferent review techniques and able to find defects which are overlapping with each other’s output.Similarly they use different test methods and they also yield results which are overlapping with each other.The total defects found is the target and its met under a particular combination of review and Test methodand we can use Fuzzy logic in modified form to demonstrate it.

a) Study the distributions by Review Type and configure them in input. If there is fuzziness among the datathen there can be overlap

b) Study the Test method and their results, and configure their distribution in the tool

c) In output Window configure the Defect Target (Met/Not met) with target values.

d) The tool will help to form the rules with different combination and the user has to replace the questionand give the expected target outcome.

e) In the control by moving the values of Review and Test method (especially in overlapping area) the toolgenerates certain score ,which tells about what will the degree of membership with met and Not met. Thehigher value combination out of this shows there is more association with results.

f) One of the way by which we can deploy this is by simulating this entire scenario multiple times andthereby making this as stochastic relationship than deterministic. Which means usage of Monte carlosimulation to get the range of possible results or probability of meeting the target using Fuzzy logic.

Many a times we don’t apply Fuzzy logic to complete extent or model as it is in software industry,however the fuzziness of elements are taken and modelled using statistical or mathematical relationshipto identify range of outputs . This is more of hybrid version than the true fuzzy logic modelling.

FUZZY LOGIC - SAMPLE

Input and output Variables are

described. Each Variable has different

techniques and distribution here. We

assumed triangular distribution here, but

we are expected to use true

distribution

The Rule statement can be derived by

pressing the magic wand button and we

have to just replace the questions with

the values of Total Defect (Met or

Notmet) to complete the rules.

As we keep moving towards the different methods

the degree to which we can meet the Defect target

is increasing. Also this shows in more than one type

or combination we can achieve the target. Usage

of Optimization technique here will reveal the

best combination

The Distributions are Review types and Test

methods are configured and the rule which we set

earlier is used to determine the degree of results.

MONTE CARLO SIMULATION

Monte carlo simulation is used mainly to study the uncertainties in the value of interest. Its statistical methodof simulation, which uses the distributions and randomness to perform simulation. In simulation model theassumptions of the system are built and a conceptual model is created, and using monte carlo method thesystem is studied using number of trials and variations in the distributions, which results into range of outputs.

For an example to study the life of a car engine, we can’t wait till it really gets wear out, but by usingdifferent conditions and assumptions the engine is simulated to undergo various conditions and the wear outtime is noted. In Monte carlo method, its like we test another 100 such engines and finally get the resultsplotted in histogram. The benefit is, that this is not a single point of outcome, but it’s a range, so we canunderstand the variation with which the life of engine could vary. Similarly since we test many, we canunderstand the probability of an engine having a life beyond a particular value (ex: 15 years).

The computers have made the life easy for us, so instead of struggling for 100 outcomes ,we can simulate5000, 10000 or any number of trials using the monte carlo tools. This method has helped us to convert themathematical and deterministic relationship to be made as stochastic model by allowing range /distributionsof factors involved them, there by getting the outcome also under a range.

The model gives us the probability of achieving a target, which is in other words the uncertainty level.

Assume a deterministic relationship of Design Effort (X1)+ Code Effort (X2)= Overall Effort(Y), which can bemade as stochastic relationship by building the assumptions (variation of X1 & X2 and distribution) ofvariables X1, X2 and running the simulation for 1000 times and storing all the results of Y and buildinghistogram from it. Now what we will get is a range of Y. The input variation of X1 and X2 is selectedrandomly from the given range of X1 and X2. For example if code effort varies from (10, 45) hrs then anyrandom values will be selected to feed into equation and get a value of Y.

MONTE CARLO SIMULATION- STEPSMonte carlo technique can be demonstrated using excel formulas also, however we willdiscuss the relevant topics based on crystal ball (from Oracle) tool, which is anotherexcel plug in.

Performing simulation:

A) The data of any variable can be studied for its distribution and central tendencyand variation using minitab or excel formula.

B) The influencing variable names are entered (X’s) in excel cells and their assumptions(where distributions and their values) are given

C) Define the outcome variable(Y) and in the next cell give the relationship of X’s with Y.It can be a regression formula or mathematical equation, etc (with mapping of X’sassumption cell in to the formula)

D) Define the outcome variable formula cell as Forecast Cell. It would require justnaming the cell and providing a unit of outcome.

E) In the preferences, we can set any number of simulation we want the tool to perform.If there are many X’s , then increase simulation from 1000 to 10000,etc. Keep a thumbrule of 1000 simulation per X.

F) Start the simulation, the tool will run the simulations one by one and keeps theoutcome in memory and then plots a Histogram of probability of occurrence with values.We can give our LSL/USL targets manually and understand the certainty by % or viceversa. This helps us to understand the Risk against achieving the target.

MONTE CARLO SIMULATION- STEPS

Optimization:

G) Though in simulation we might have seen the uncertainty of outcome, we have toremember that some X’s are controllable (Hopefully we have modelled that way) andby controlling them, we can achieve better outcome. OptQuest feature in the tool helpsus to achieve the optimization by picking the right combination of X’s.

H) At least one Decision Variable has to be created to run OptQuest. Decision variablesare nothing but controllable variables, and without them we cant optimize.

I) Define the Objective (maximize/minimize/etc with or without a LSL/USL) and tooldetects Decision Variables automatically. We can introduce constraints in decisionvariables (Ex: A particular range within with it has to simulate). Run Simulation(Optimization is based on simulation), the tool runs with random picking values within therange of decision variables and records the outcome and for best combination of X’sfor which target of Y is met, it keeps that as best choice, until something more bettercomes within the cycles of simulation.

J) The best combination of X’s are nothing but our target values to be achieved inproject and the processes which has capability to achieve these X’s are composed inProject.

MONTE CARLO SIMULATION- SAMPLEA Project team receives and works on medium size (200-250 FP) development activities and whenever their internal defectsexceeds more than 90 or to a higher value, they have seen that UAT results in less defects. They use different techniques ofreview and testing based on nature of work/sub domains and each method gives an overlapping results of defect identifiedand there is no distinctness in their range. Now we are expected to find their certainty of finding defects more than 90 and tosee what combination of review and test type, the project will find more defects.

Green Cells mark the assumptions and

Blue Cells mark the assumption and

Amber cells mark the Decision

variable.

The simulation shows after 5000 tests the

results predetermined combination and

certainty of 96.88 % to be more than 90.

Sensitivity chart shows Test Results influences

around 70% of outcome and then Reviews

Defining Objective, decision variable,

constraints and run the optimization

In Decision Variable, the review method and Test method are

given as discrete values, so the tool takes different method

and combinations and tries the best fit which gives maximum

number of defects. We can compose our process now with

the given method, as this will help us to achieve higher

outcome in this case.

MONTE CARLO TOOLS

Tools like JMP, Processmodel, BaysiaLab has in built simulation features within them andthere we don’t need to use Crystal ball kind of tool.

Recently in Minitab 17, we have profilers and optimizers added in the regressionmodels, which reduces the need of additional tools. However it has limitation of only forRegression.

Simulacion is a free tool and it has an acceptable usage with upto 65000 iterationsand 150 input variable. This is another Excel Add on.

Risk Analyzer is another tool which is similar in Crystal ball and is capable ofperforming most of the actions . However this is paid software.

There are many free excel plugin’s are available to do Monte carlo simulation and wecan also build our own simulation macros using excel.

SYSTEM DYNAMICS

System Thinking is the basis behind System Dynamics models. We alloperate in a system which continuously has some input and outputs and itsinfluenced by many factors. Our System is represented as Stock and Flow,which keep continuously changing the state over time.

The model is made of causal loops which represents the system. Often itsdifficult to interpret the behaviour of causal loop and its behaviour insystem without adequate simulation. System Dynamics offers us the abilityto model the system to simulate and study its behaviour. The model is usedon strategic decision making and planning.

System Dynamics is a deterministic model which can be added with Noiseand using relevant software we can model the stochastic behaviour. Herewe don’t study the behaviour of elements in an event, but aggregates attime slices. Considering our engagements are now becoming big and ourorganization’s needs decision making tools, system dynamics is animportant technique in the offering to model dynamic behaviour of oursystems. A basic study of causal loops and flows, stocks and auxiliaryvariables are recommended for the audience. However the modern toolshave made it as less complex modelling technique.

SYSTEM DYNAMICS - STEPSWe are going to explain System Dynamics using “Insightmaker” an online free tool, whichgives all the basic functionalities expected at no cost. However the same can be modelledusing “Vensim”, which is another powerful tool.

A) Understand the problem or goal which we want to study and identify the causesrelevant to it. Developing a basic causal loop diagram is recommended, however for thetool its not mandatory.

B) Every system has input (here its “flow”) and these inputs are there to be processed (herewe say “Stock”) at initial stage. In a system atleast one flow and stock should exist.

C) As a method, we can build stock and flow continuously to model the system andAuxiliaries to be intermediate state variables used to represent the in between state andinfluencing factors. Connectors/Links are the way by which variable are connected.

D) In this tool, create new insight and draw your system components using the tools given.Each component is highly customizable in this tool , in terms of size, color, font ,etc.

E) Establish Links between variables and name all the variables in an identifiable manner.

F) We can write the relationships using formulas (in the “Value” option) and we can feed inthe noise using “Distributions” of data studied. The tool offers most of mathematical andlogical formulas. The formulas are normally simple and anyone with excel understandingcan easily write the relationships. If we use Vensim, then the tool dynamically validates thelinks with variables and usage of those variables in the current formula.

SYSTEM DYNAMICS- STEPS

G) Once all the relevant relationships are established then we can simulate the model. Heresimulation refers to the time period up to which we study the deterministic behaviour of thesystem. For example, if we configure daily data in the system then we may simulate up to amonth or a quarter. This will help us to understand if the system is having accuracy. Use amethod like MAPE to understand how for the results of predicted value and actual valuevaries.

H) Monte Carlo simulation which is required to study the stochastic behaviour of the system isavailable under sensitivity analysis. In Vensim, this is available in Professional version (& PLEPlus). This is important for us to understand the confidence level of meeting the target andalso the variation of the factors/variables (which will help in sub process management). Wecan get the data and graphs from the tool.

I) Goal Optimization is part of this online tool, where we can select the variable we want tomaximize or minimize with the relevant constraints of other variables. The optimizations resultsare helpful in composing the process and fixing the goals for sub processes. In Vensim it offersCalibration and policy fixing options under optimization.

J) The models created using insightmaker is available in their site always and can becontrolled for access and good revision management and cloning techniques are available toshare with others online.

SYSTEM DYNAMICS- SAMPLEAssume a case where Incident tickets closed per person is the productivity considered and backlog is targeted in a maintenanceproject. The incidents exhibit significant difference based on priority (P1,P2 & P3,P4 as two groups) in terms of effort taken andtime spent. We are requested by management to understand behaviour of these variables and optimize the factors…

Blue Boxes are Stocks and Pink Eclipses are

variables. There are links which connets

them. The backlog and productivity is

highlighted in separate colours..

Sensitivity and

Optimization techniques

are there in drop down.

Simulation here refers to

time based variation. The

tool gives various factors

and their values.

Variation shown here is for

backlog. In sensitivity analysis

the tool applies Monte Carlo

simulation and give confidence

Level with values

Optimization Results for

Productivity In terms of identified

Variables. Here its Effort by Sub

process by Skill

Simple formula Window for

writing formula and logics.

SYSTEM DYNAMICS TOOLS

As we have seen Insightmaker.com online tool with its functions, we can look at othertools. The first one which comes to mind is Vensim. The tool has free version Vensim Pluswhich doesn’t have monte carlo simulation and Optimization. We would recommend touse Vensim Professional which is a paid software, however has all the possible formulasand relevant support groups to resolve our doubts. This is one of the best softwareavailable in the market today.

We have tried semantics, which is a free open source code based tool. We were happyabout its results. However the formulas for probability distributions are not available bydefault. The tool uses Modelica language, so if you have a developer free for fewdays, we can really make this wonderful with relevant algorithms. The tool has MonteCarlo simulation for sensitivity analysis.

We have worked on STELLA, and this is also a good software. However the simulationand optimization parts we couldn’t find. We have used Anylogic also, but these toolsrequires better understanding to make it work for System Dynamics, though they havelots of options.

Apart from Powersim, Goldsim and many other tools are available in market. Howeverfrom our understanding Insightmaker online tool and Vensim Professional are the two wecan consider from CMMI point of view.

MODEL SELECTION - THOUGHTQPPO

Time defect Effort Resource

D D D D

PRJ ORG PRJ ORG PRJ ORG PRJ ORG

Queuing

Models

Neural

Networks

Bayesian

belief

Regression

Model

Bayesian Belief

Simulation

models

Other

statistical

models

D is based on,

• High Frequency of

data in project

• Seasonal variation

Or Time Dependency

• Project characteristic

differentiates from

others

Reliability

Models

Fuzzy Logic

System

Dynamics

Neural

Networks

Regression

Model

Bayesian Belief

Simulation

models

Other

statistical

models

Neural

Networks

Simulation

models

System Dynamic

Regression

models

Other statistical

models

Regression

Models

Bayesian

Belief

Queuing System

Simulation

Models

Fuzzy Logic

Bayesian

Model

Simulation

Model

Linear & Non Linear models to be selected based on Linearity “Y”

KEY CHARACTERISTICS TO DETERMINE MODEL

Robustness of model

Prediction Accuracy of model

Flexibility in varying the factors in model

Caliberation abilities of the model

Availability of relevant tool for building the model

Availability of data in the prescribed manner

Data type of the variable and factors involved in the model

Ability to include all critical factors in the primary data type (not to convert in to a different scale)

REFERENCES

•CMMI v1.3 Nov 2010, Technical Report

• If you’re living the “High Life”, you’re living the informative Material – Rusty Young, Bob Stoddard and Mike Konrad (Mar 2008)

•SIMULATION AND MONTE CARLO ‘Some General Principles’ - James C. Spall (2007)

• Improved MDL Score for Learning of Bayesian Networks : Zheng Yun and Kwoh CheeKeong

•A New Measure for the Accuracy of a Bayesian Network: Alexandros Pappas, Duncan Gillies

• Introduction To Neural Networks : Prof. George Papadourakis, Ph.D.

• Monte Carlo Simulation : Fawaz hrahsheh , Dr. A. obeidat

• FUZZY LOGIC - Shane Warren, Brittney Ballard

•Systems Thinking, System Dynamics, Simulation – James R.Burns Summer 2009

• In addition Tool Manual of : SAS JMP, Minitab, BayesiaLab and Processmodel

TEAM COMPRISES OF

Thirumal Shunmugaraj

Sunil Shirurkar

Snehal Pardhe

Thanks To: Koel Bhattacharya (System Dynamics)

SCREENSHOTS CONTRIBUTION FROM

Minitab 17

Processmodel 5.5

BayesiaLab

SAS JMP 11.0

Qtfuzzylite

Crystal Ball 11

Insightmaker.com

CONTACT US

https://www.dropbox.com/s/3p4eknmgbn6x4o7/Datafiles.7z?dl=0

mailto:[email protected]




mailto:[email protected]