namp module 17: “introduction to multivariate analysis” tier 3, rev.: 4 program for north...

48
NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration for Environmental Control in Engineering Curricula MODULE 17: “Introduction to Multivariate Analysis” Created at: Ecole Polytechnique de Montreal & North Carolina State University, 2003. NC STATE UNIVERSITY

Upload: jared-sutton

Post on 16-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Program for North American Mobility in Higher Education

Introducing Process Integration for Environmental Control in Engineering Curricula

MODULE 17: “Introduction to Multivariate Analysis”

Created at:Ecole Polytechnique de Montreal &

North Carolina State University, 2003.

NC STATEUNIVERSITY

Page 2: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

2.4: Example (3)

Shorter Timescales

Page 3: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Shorter timescales

The previous two examples used daily averages for the 130 process variables. However, we could just as easily have chosen weekly averages, monthly averages, or several other options.

We could also have chosen shorter timescales, such as 8-hour averages or 30-minute averages. Obviously, at some point the number of observations will become unmanageable. For instance, a spreadsheet with 3 years’ worth of 1-minute averages would have over a million lines.

    

Simply by choosing the timescale, you are already influencing your MVA results.

Example 3

Page 4: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

10

s

1 m

in

10

min

1 h

8 h

24

h

1 w

k.

1 m

o.

1 y

ea

r

Pulp sampled every 2 hours

Chips sampled every 8 hours

Choosing a timescale

The first thing we need to understand is what timescales are available. For the TMP process we have been studying, the shortest possible time period between two logged values is 10 seconds (note that not all tags are updated this frequently).

Several key values, such as wood and pulp characteristics, are only measured every few hours as shown above. These tags will be of little or no use at a very short timescale.

IMPORTANT CONCEPTIMPORTANT CONCEPT: Some variables can only be : Some variables can only be studied at studied at longerlonger timescales, others at timescales, others at shortershorter timescales, timescales, depending on their sampling/logging frequency.depending on their sampling/logging frequency.

IMPORTANT CONCEPTIMPORTANT CONCEPT: Some variables can only be : Some variables can only be studied at studied at longerlonger timescales, others at timescales, others at shortershorter timescales, timescales, depending on their sampling/logging frequency.depending on their sampling/logging frequency.

Example 3

Page 5: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Shortest possible timescale

For the purposes of illustration, we will use the shortest possible timescale in this example, namely 10 seconds. Because some tags are updated less frequently, we will use interpolated values for all variables, which may or may not represent reality.

10 seconds

To keep the size of the dataset manageable, we have taken these data over a 24-hour period, which corresponds to around9,000 observations. Because we have over 100 tags, the resulting dataset has about one million values.

A million values per day, for only one section of the A million values per day, for only one section of the papermaking process - if we were to include the entire papermaking process - if we were to include the entire industrial plant over several years, we would have to industrial plant over several years, we would have to analyseanalyse billions billions of datapoints. of datapoints.

A million values per day, for only one section of the A million values per day, for only one section of the papermaking process - if we were to include the entire papermaking process - if we were to include the entire industrial plant over several years, we would have to industrial plant over several years, we would have to analyseanalyse billions billions of datapoints. of datapoints.

Example 3

Page 6: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

PCA of entire 24-hour period

0.00

0.20

0.40

0.60

0.80

1.00

Com

p[1]

Com

p[2]

Com

p[3]

Comp No.

Jun 20 02(1). 10 seconds COMPLETE WITH 45 min LAG.M1 (PCA-X), UntitledR2X(cum)Q2(cum)

Simca found numerous Simca found numerous components components retained 3 retained 3Simca found numerous Simca found numerous components components retained 3 retained 3

The PCA for the entire 24-hour period shows quite a strong model, with a cumulative R2 over 60%. This is misleading, however. As shown on the score plot, there is a major process excursion which has totally skewed the MVA results.

Example 3

Page 7: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Major process excursion

Major process excursion from 8h15 to 8h45

A review of the original data indicates that production dropped below 10 t/d during a ten-minute period (8:15 to 8:25). The cause was a major refiner blockage known as a “feedguard event”, which makes the refiner motor shut down.

Example 3

Page 8: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Exclude process excursion

The process excursion sticks out like a sore thumb on the score plot. This means that the process temporarily went to a radically different “place” or operating regime, where relationships between the variables are different.

Trying to do PCA on several different operating regimes all at once is a waste of time. The software will try to establish the correlations between the different variables, and if these correlations change abruptly the results will be useless. The way to get around this problem is to divide the observations into different operating regimes, and study each regime separately.

In this case we will remove the low production period to prevent it from skewing the rest of the results.

Sticking out like a sore thumb…or a solar flare

Example 3

Page 9: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

We removed the entire period when the process was perturbed (8:10 to 8:45) and did a PCA on the rest of the observations.

Interestingly, the R2 values went down slightly. This is because many of the variables changed abruptly all together when the process was shut down, making it look like they were “correlated” with each other.

Remember, MVA knows nothing about the process, and just uses the data as it is.

PCA with process excursion removed

0.00

0.20

0.40

0.60

0.80

1.00

Com

p[1]

Com

p[2]

Com

p[3]

Comp No.

Jun 20 02(1). 10 seconds COMPLETE WITH 45 min LAG.M2 (PCA-X), Extreme outliers removedR2X(cum)Q2(cum)

Example 3

Page 10: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Score plot of normal operation

Now that we have removed the process upset, the score plot takes on an entirely different character.

There is now an obvious time trend. During our 24-hour period, the process “snakes” around in multi-dimensional space. It is a moving target.

Almost all process data show this characteristic, because a real process is never really in steady state. The process control systems are constantly responding to outside perturbations, like changes in feed material quality. Operator intervention is another source of perturbation. There are many others. One operating goal is to maintain the “snake” within a certain desirable zone.

Whereas score plots for longer, averaged periods generally resemble clouds, score plots for short timescales resemble snakes.

Example 3

Page 11: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Start:01:00

Start:01:00

End:00:59

End:00:59

Obvious time trend…

Obvious time trend…

Score plot showing time trend

Example 3

Page 12: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

What is the significance?

This “snaking” of the process at short timescales is highly significant. This was not seen when using the daily averages.

By looking at which variables are changing with time, we can get tremendous insight into the process dynamics. One way to do this is to compare the contribution plots (like we saw in Example 2) at different times.

Contribution plots for the start and end points of our 24-hour period are shown on the next page. Obviously it is impossible to read the names of all the variables, but that is not the point. Just look at the bar graphs. They are very different, indicating a continuous change in operating regime from start to finish.

Example 3

Page 13: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

-2024

33LI214.A

I52F

FC

117.P

V52F

FC

166.P

V52F

IC115.P

V52F

IC116.P

V52F

IC154.P

V52F

IC164.P

V52F

IC165.P

V52F

IC167.P

V52F

IC177.P

V52H

IC

812.P

V52IIC

128.P

V52IIC

178.P

V52JC

C139.P

V52JI189.A

I52JIC

139.A

I52LIC

106.P

V52P

CA

111.P

V52P

CA

161.P

V52P

CB

111.P

V52P

CB

161.P

V52P

IC105.P

V52P

IC159.P

V52P

IC705.P

V52P

IC961.P

V52S

IC110.P

V52S

QI1

10.A

I52T

I011.A

I52T

I031.A

I52T

I118.A

I52T

I168.A

I52T

IC010.C

O52T

IC793.P

V52X

AI1

30.A

I52X

IC130.A

I52X

IC180.A

I52X

PI1

30.A

I52X

QI1

95.A

I52Z

IC147.P

V52Z

IC148.P

V52Z

IC197.P

V52Z

IC198.P

V53A

I034.A

I53F

FC

455.P

V53F

I012.A

I53H

IC

762.P

V53LIC

011.P

V53LIC

301.P

V53N

I716.A

I53N

IC

013.P

V53P

IC210.P

V53P

IC305.P

V53P

IC308.P

V53P

IC309.P

V53W

I012.A

IP

ex_L1_B

lan

Pex_L1_C

ons

Pex_L1_C

SF

Pex_L1_LM

FP

ex_L1_P

200

Pex_L1_P

FC

Pex_L1_P

FL

Pex_L1_P

FM

Pex_L1_R

100

Pex_L1_R

14

Pex_L1_R

28

Pex_L1_R

48

53LIC

510.P

V52F

R960.A

I52F

RA

703.A

I52K

QC

139.A

I52K

QC

189.A

I52P

I128.A

I52P

I178.A

I52P

I706.A

I52P

IA143.A

I52P

IA193.A

I52P

IB143.A

I52P

IB193.A

I52P

IP143.A

I52P

IP193.A

I52S

I055.A

I52S

IA110.A

I52T

IC102.P

V52T

IC711.P

V52T

R964.A

I52X

IC811.P

V52X

_130.A

I_split_

L1.

52Z

I144.A

I52Z

I194.A

I53A

IC453.P

V53LR

405.A

I53LV

301.A

I53N

IC

100.P

V85LC

B320.A

I

Score C

ontr

ib(O

bs 457 -

Average), W

eig

ht=

p1p2

Var ID (Primary)

Jun 20 02(1). 10 seconds COMPLETE WITH 45 min LAG.M3 (PCA-X), More extreme outliers removedScore Contrib(Obs 457 - Average), Weight=p1p2

-2-10123

33LI2

14.A

I52F

FC

117.P

V52F

FC

166.P

V52F

IC115.P

V52F

IC116.P

V52F

IC154.P

V52F

IC164.P

V52F

IC165.P

V52F

IC167.P

V52F

IC177.P

V52H

IC812.P

V52IIC

128.P

V52IIC

178.P

V52JC

C139.P

V52JI1

89.A

I52JIC

139.A

I52LIC

106.P

V52P

CA

111.P

V52P

CA

161.P

V52P

CB

111.P

V52P

CB

161.P

V52P

IC105.P

V52P

IC159.P

V52P

IC705.P

V52P

IC961.P

V52S

IC110.P

V52S

QI1

10.A

I52T

I011.A

I52T

I031.A

I52T

I118.A

I52T

I168.A

I52T

IC010.C

O52T

IC793.P

V52X

AI1

30.A

I52X

IC130.A

I52X

IC180.A

I52X

PI1

30.A

I52X

QI1

95.A

I52Z

IC147.P

V52Z

IC148.P

V52Z

IC197.P

V52Z

IC198.P

V53A

I034.A

I53F

FC

455.P

V53F

I012.A

I53H

IC762.P

V53LIC

011.P

V53LIC

301.P

V53N

I716.A

I53N

IC013.P

V53P

IC210.P

V53P

IC305.P

V53P

IC308.P

V53P

IC309.P

V53W

I012.A

IP

ex_L1_B

lan

Pex_L1_C

ons

Pex_L1_C

SF

Pex_L1_LM

FP

ex_L1_P

200

Pex_L1_P

FC

Pex_L1_P

FL

Pex_L1_P

FM

Pex_L1_R

100

Pex_L1_R

14

Pex_L1_R

28

Pex_L1_R

48

53LIC

510.P

V52F

R960.A

I52F

RA

703.A

I52K

QC

139.A

I52K

QC

189.A

I52P

I128.A

I52P

I178.A

I52P

I706.A

I52P

IA143.A

I52P

IA193.A

I52P

IB143.A

I52P

IB193.A

I52P

IP143.A

I52P

IP193.A

I52S

I055.A

I52S

IA110.A

I52T

IC102.P

V52T

IC711.P

V52T

R964.A

I52X

IC811.P

V52X

_130.A

I_split_

L1.

52Z

I144.A

I52Z

I194.A

I53A

IC453.P

V53LR

405.A

I53LV

301.A

I53N

IC100.P

V85LC

B320.A

I

Score C

ontr

ib(O

bs 7

910 -

Average), W

eig

ht=

p1p2

Var ID (Primary)

Jun 20 02(1). 10 seconds COMPLETE WITH 45 min LAG.M3 (PCA-X), More extreme outliers removedScore Contrib(Obs 7910 - Average), Weight=p1p2

Time trend within the process

01:0001:00

00:5900:59

Contribution plots…

Page 14: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Studying the “snake”

To gain further insight, we can colour-code the observations on the score plot. We did something similar in Example 1, when we colour-coded the days to show the seasons. This is very easy to do with modern MVA software.

In this case, we have modified the score plot to show which range that observation falls in for one of the variables. In this case we have chosen “freeness”, an important pulp quality parameter which the process control systems try to maintain at a constant value. We could have chosen any variable.

Note that during the course of our 24-hour period, the freeness starts high, then gets lower, then goes back up again. Someone with an intimate knowledge of the process could gain insight from this result.

Example 3

Page 15: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Exactly the same score plot, coloured for pulp “freeness”

Exactly the same score plot, coloured for pulp “freeness”

Score plot coloured for “freeness”

Example 3

Page 16: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Same plot, showing 3rd component

Same plot, showing 3rd component

Component 2

Component 1

Component 3

Score plot in 3-D

Example 3

Page 17: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

MVA “foresight”

Another powerful use of MVA over short timescales is to predict problems before they become more widely visible.

The residuals plot on the next page tells the whole story. Remember we said that the refiner shut down at 8:15 due to a blockage? It is obvious that the process started to move away from normal operation well before then. The operators tend to look at a handful of key variables when monitoring the process, but MVA looks at all the variables at the same time andis therefore much more sensitive.

An analogy would be aseismometer being used topredict volcanic eruptions.

Example 3

A seismometer is extremely sensitive to the slightest vibrations.

Page 18: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Residuals plot showing MVA “foresight”

Example 3

Build-up to 8h15 – something is happening to the process!

Build-up to 8h15 – something is happening to the process!

Page 19: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Using shorter timescalesBy now it should be clear that doing MVA at a shorter timescales is totally different to studying averages taken over longer timespans. Once again, we conclude that the best solution is to try many different approaches. No single MVA approach will provide all the answers we are seeking.

Part of the power of this technique is the way completely different results can be obtained from exactly the same database, simply by “slicing and dicing” the data in various ways:

• Longer vs. shorter timescales• More vs. fewer variables• PCA vs. PLS

MVA is just a “black box”. Its use MUST be driven by an understanding of the process being studied, otherwise it is just meaningless number-crunching.

Example 3

“Number Cruncher”

Page 20: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

End of Example 3:

One step at a time…

Page 21: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

End of Tier 2

Congratulations!

This is the end of Tier 2. Obviously the details of these examples are hard to grasp for a first-timer, but hopefully some of the overall patterns are starting to emerge. A true understanding of MVA can only come by actually doing it on your own, which is the purpose of Tier 3.

All that is left is to complete the short quiz that follows…

Page 22: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 1:

What is the difference between a tag and a variable?

a) The words “tag” and “variable” are synonyms.b) A tag is an identity label or address, while a variable is an

attribute of the process. c) Tags change with time, but variables are fixed. d) Variables measure similar attributes, while tags measure

dissimilar attributes.e) Answers (b) and (c).

Tier 2 Quiz

Page 23: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 2:

Does averaging reduce or increase noise?

a) Averaging increases noise significantly.b) Averaging increases noise, but only slightly.c) Averaging does not affect noise.d) Averaging reduces noise.e) Averaging reduces noise, but increases the likelihood of outliers.

Tier 2 Quiz

Page 24: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 3:

What is the danger of interpolating between readings that are far apart in time?

a) The interpolation will give far more weight to these individual readings than they deserve.

b) The interpolated values will indicate slow upward and downward trends where there are none.

c) The effect of outliers will be enhanced many-fold. d) The engineer will have the false sense of comparing variables

that are similar, when in fact they are very different. e) All of the above.

Tier 2 Quiz

Page 25: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 4:

If interpolation is such a problem, then why can’t we just use the discrete values instead?

a) This would give far too much weight to periods with a large number of discrete values.

b) Discrete values must be averaged to have meaning. c) No tag is ever truly discrete.d) Discrete values have no time signature. e) Answers (b) and (c).

Tier 2 Quiz

Page 26: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 5:

What is the difference between a process lag and a delayed reading?

a) One is caused by the process itself, the other by the measurement instruments.

b) They are the same thing. c) A process lag is due to residence time, while a delayed reading

is due to the time required for sampling, measurement and recording.

d) One is much longer than the other. e) Answers (a) and (c).

Tier 2 Quiz

Page 27: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 6:

Why does the MVA software reject variables that do not change enough with time?

a) Only variables which are part of the “experiment” are permitted.b) Tags change with time, but these variables are fixed. c) There are insufficient data points.d) If a variable does not change with time, then it cannot be

correlated to any other variables.e) None of the above.

Tier 2 Quiz

Page 28: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 7:

What should you do if your initial PCA gives a score plot with two distinct and separate data clouds?

a) Study each data cloud separately.b) Try to determine what these two clouds represent. c) Ignore the first component, which is probably being artificially

induced by the two clouds.d) Do an MVA on the entire dataset. e) Answers (a), (b) and (c).

Tier 2 Quiz

Page 29: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 8:

Your residual (“DModX”) plot shows several moderate outliers. What should you do?

a) Remove them and continue. b) Leave them in and continue.c) Study their contribution plots.d) Look at the original data to try to determine the cause. e) Answers (c) and (d).

Tier 2 Quiz

Page 30: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 9:

Two variables are located in opposite corners of your PCA loadings plot (components 1 and 2). What do you conclude?

a) These variables are uncorrelated with each other.b) These variables are negatively correlated with each other. c) These variables contribute to both the first and second

components. d) These variables contribute to neither the first nor the second

component. e) Answers (b) and (c).

Tier 2 Quiz

Page 31: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 2 Quiz

Question 10:

Theoretically, on average what proportion of residuals should be above the 95% confidence line? (the red line on the “DModX” plot)

a) Exactly 0.05%b) Exactly 5%.c) More than 5%.d) Less than 5%.e) Depends on the dataset.

Tier 2 Quiz

Page 32: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

TIER 3:

Open-Ended Problem

Page 33: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 3: Statement of intent:

The goal of Tier 3 is to finally allow the student to do MVA independently, though in a controlled context. At the end of Tier 3, the student should know how to do the following:

• Prepare a spreadsheet for use in MVA• Import spreadsheet into MVA software• Set up dataset within MVA software• Create simple PCA plots• Identify and investigate major and moderate outliers• Create and interpret more elaborate PCA plots

In order to avoid losing the student along the way, each of these steps is broken down into a series of sub-steps with clear instructions.

Tier 3: Statement of Intent

Open Problem

Page 34: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Tier 3 is broken down into four sections:

3.1 Problem Statement and Dataset

3.2 Preparing and Importing the Spreadsheet

3.3 Initial MVA Results

3.4 Outliers and More Elaborate MVA plots

Unlike the previous two sections, Tier 3 has no quiz. The student must submit the results of the above work in a succinct project report (10-15 pages).

Tier 3: Contents

Open Problem

Page 35: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

3.1: Problem Statement and Dataset

Open Problem

Page 36: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Problem StatementYour are the process engineer at the TMP mill from the Tier 2 examples. Your boss, the plant manager, wants to know why the pulp has different properties in the summer than in the winter.

You decide to start by generating PCA results for two different datasets, one taken during the summer, the other during the winter, and then comparing them to each other.

Open Problem

Page 37: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Summer/Winter datasets

After talking to the operators, you decide to take two full weeks of data for 15 key tags, using 1-hour averages.

Your data have already been imported by an IT technician into a standard spreadsheet software. The two files are:

• Summerdata.xls

• Winterdata.xls

Open these files, and have a look at the data. Can you tell anything about the summer/winter question just by looking?

Of course not!

These are the actual data files you are going to use!

Open Problem

Page 38: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

3.2: Preparing and Importing the Spreadsheet

Open Problem

Page 39: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Preparing the spreadsheet

As you can see, the spreadsheet has two names for each variable: • long descriptive name, and • short “tag” for easy identification on the MVA graphs.

We want to do something similar with the individual observations. The full time signature is too long, and will make the score plots impossible to read. Besides, we already know which year and month it is. This is not useful information. We therefore want to insert a column to the right of the time signature, which gives the number of hours from the start of the two-week period.

Do this now, for both spreadsheets. When you are done, save them under a new name.

Open Problem

Page 40: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Importing the spreadsheet

Now we are ready to open the MVA software. Do it now.

The first thing we need to do is import the data. Go to “File: import data”, and select your newly renamed file for summer.

The software will ask you a series of questions. Answer them according to the instructions on Page 2 of the spreadsheet file. One of these steps involves saving the new dataset as an MVA file.

Repeat this operation for the winter spreadsheet.

Open Problem

Page 41: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

3.3: Initial MVA Results

Open Problem

Page 42: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Initial MVA results

Re-open the summer file, and create the following plot:• Model bar chart

How many components does the software suggest? Usually for this kind of initial exercise, keeping 3 components is normal. Eliminate the components you do not intend to use.

Now create the following basic PCA plots:• Score plots: t(1) vs. t(2) What do you notice about the results? Right! There are major outliers.

Now do the same for the winter dataset.

Copy it by right-clicking and import it into your word processor file. All these plots must appear in your report.

Open Problem

Page 43: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

3.4: Outliers and More Elaborate MVA Plots

Open Problem

Page 44: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Investigating Outliers

The summer data contains a major process excursion that is clearly visible on the score plot. Looking at the original data, try to determine the cause.

Once you are satisfied, remove the outliers and save the new model.

The winter data looks OK on the score plot, but that is not the entire story. Generate the following residuals plot:• DModX

What do you notice? Right! There is one major outlier. Create a contribution plot to investigate:• Contribution plot

What do you conclude? Remove this point and continue.

Open Problem

Page 45: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Comparing Summer and Winter

Now we are ready to compare the summer and winter results. Create the following basic PCA plots:• Score plots: t(1) vs. t(2); t(1) vs. t(3); 3-D plot• Loadings plot: p(1) vs. p(2); p(1) vs. p(3); 3-D plot

Do you notice any major differences between summer and winter?

Of course you do! What are they?

And what does this imply about the cause of the summer/winter process differences?

Open Problem

Page 46: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

Drawing your conclusions

Now you have something to report to your boss…

Open Problem

Page 47: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

More Elaborate MVA Plots

To get familiar with some of the other MVA outputs, create the following for the final summer and winter datasets:• DModX • X/Y Contribution Plot• Residuals distribution• …• …

What do these plots indicate to you? Don’t worry about finding the “right” answer, just try to figure out what these plots are trying to tell us. However, you must justify your answers. Don’t just guess.

Don’t just guess!

Open Problem

Page 48: NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4 Program for North American Mobility in Higher Education Introducing Process Integration

NAMP Module 17: “Introduction to Multivariate Analysis” Tier 3, Rev.: 4

End of Tier 3

Congratulations!

This is the end of Module 17. Please submit your report to your professor for grading.

We are always interested in suggestions on how to improve the course. You may contact us as www.namppimodule.org