a nonparametric change point model for multivariate phase-ii statistical process … · 2011. 6....

28
A nonparametric change point model for multivariate phase-II statistical process control Mark Holland Douglas Hawkins School of Statistics University of Minnesota May 24, 2011 Mark Holland (UMN) Nonparametric change point model 1

Upload: others

Post on 19-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • A nonparametric change point model for multivariatephase-II statistical process control

    Mark HollandDouglas Hawkins

    School of StatisticsUniversity of Minnesota

    May 24, 2011

    Mark Holland (UMN) Nonparametric change point model 1

  • Statistical Process Control (SPC) definitions

    Statistical Process Control refers to a collection of tools designed todetect a shift in distribution of a sequence of observations.

    Phase-I SPC: Analysis is performed on a fixed set of historical data.

    Phase-II SPC: Ongoing analysis is performed on a possiblynever-ending stream of observations.

    Common cause variability is inherent variability in a process, evenwhen running as designed.

    Special cause variability is not a normal part of the process, but isthe result of the intrusion of an unexpected factor.

    A process is in control when only common cause variability exists, but isout of control when special cause variability is introduced.

    Mark Holland (UMN) Nonparametric change point model 2

  • Statistical Process Control (SPC) applications

    Traditionally used in manufacturing settings, but developments in modernindustries have created demand for new monitoring techniques

    Health care (Thor et al. 2007)I Laboratory setting, e.g. Chemical assay methodsI Direct patient care, e.g. ICU vital signs

    Post-market product performance

    Groundwater and air quality

    Many current applications require multivariate nonparametric methods

    Several measurements must be monitored simultaneously

    Multivariate normal distribution rarely applies

    Difficult to check if a data set follows multivariate normal distribution

    Mark Holland (UMN) Nonparametric change point model 3

  • Aluminum Smelter Data

    Aluminum smelting refers to an electrolysis process to reduce refinedaluminum ore into metallic aluminum.

    Data set consists of alumina (Al2O3) content of a smelter feed alongwith several impurities: silica (SiO2), ferric oxide (Fe2O3), magnesiumoxide (MgO), and calcium oxide (CaO).

    As expected with compositional data, content of compounds arenegatively correlated.

    Monitor for change in composition of alumina or any of the impurities.

    Mark Holland (UMN) Nonparametric change point model 4

  • Standard SPC tools

    Some traditional phase-II SPC methods include

    Shewart Chart, Cumulative sum (CUSUM), Exponentially weightedmoving average (EWMA)

    Limitations of traditional methods

    In-control distribution including all parameters must be known.

    In practice, parameter estimates from a phase-I training sample aretypically substituted for the truth.

    In some applications a large historical training sample is not available,so monitoring must begin shortly after data collection begins.

    I ICU vital signsI pollution control monitoring

    Must be “tuned” to detect a specific size of shift.

    Mark Holland (UMN) Nonparametric change point model 5

  • Change point approach to phase-II SPC

    Hawkins, Qiu, and Kang (2003) proposed change point model for phase-II SPC,which does not require knowledge of in- or out-of-control process parameters.

    Skeleton of change point approach:

    1. Choose two-sample test statistic for comparing left- and right-segments of process readings, {X1, . . . ,Xk} and {Xk+1, . . . ,Xn}.

    2. Apply test for all possible split-points, k = 1, 2, . . . , n − 1.3. If maximum test statistic value is outside of control limits, signal that a

    shift has occurred. Otherwise, collect another observation and repeat.

    Originally implemented with likelihood ratio test for shift in mean for univariatenormal data

    Zamba and Hawkins (2006) extended using likelihood ratio test for shift inmultivariate normal data

    Deng (2009) extended using univariate Wilcoxon-Mann-Whitney nonparametrictest for difference in location

    Mark Holland (UMN) Nonparametric change point model 6

  • Rank based multivariate change point model

    We used existing hypothesis test proposed by Choi and Marden (1997) todesign a change point model for phase-II SPC use.

    We observe n random vectors from a multivariate location familydistribution

    X1,X2, . . . ,Xk ∼F (µ)Xk+1,Xk+2, . . . ,Xn ∼F (µ + δ).

    and we wish to test

    H0 : δ = 0 vs.

    Ha : δ 6= 0

    Mark Holland (UMN) Nonparametric change point model 7

  • Multivariate nonparametric test (Choi and Marden 1997)

    Suppose we observe a sample of p × 1 random vectors X1, . . . ,Xn. For1 ≤ i , j ≤ n, define

    Dij =Xi − Xj||Xi − Xj ||

    and for 1 ≤ i ≤ n, define

    Rn(Xi ) =n∑

    j=1

    Dij .

    Then, Rn(Xi ) is the centered directional rank vector of Xi .

    Mark Holland (UMN) Nonparametric change point model 8

  • Multivariate nonparametric test (cont’d)

    Next, let

    R̄(k)n =

    1

    k

    k∑i=1

    Rn(Xi ).

    and define the covariance matrix estimator

    Σ̂Rk,n =n − k

    (n − 1)nk

    n∑i=1

    Rn(Xi )Rn(Xi )′.

    Finally, define the test statistic

    Rk,n = R̄(k)′n Σ̂

    −1Rk,n

    R̄(k)n .

    Under mild conditions, Rk,n has asymptotic null distribution χ2p.

    Mark Holland (UMN) Nonparametric change point model 9

  • Multivariate nonparametric change point model

    Test statistic for existence of a change point

    Rmax,n = max1≤k≤n−1

    Rk,n

    Estimate of the location of the change point

    τ̂R,n = arg max1≤k≤n−1

    Rk,n

    Mark Holland (UMN) Nonparametric change point model 10

  • Fixed-sample size simulation results

    When both k and n − k are large, the distribution of Rk,n isapproximately χ2p, as expected (k = 100, n − k = 50).

    When k or n − k is small, the distributions of Rk,n, Rmax,n, and τ̂R,nare affected by the dependence structure of the simulated data.

    The following plots show the estimated distribution of the location ofthe maximum Rk,n value for a sample of n = 200 equicorrelated MVNrandom vectors with ρ = 0, 0.9.

    Mark Holland (UMN) Nonparametric change point model 11

  • 0 50 100 150 200

    0.01

    0.02

    0.03

    0.04

    Distribution of τ̂T and τ̂R

    k

    prop

    ortio

    n

    p = 10 , ρ = 0

    RknTkn

    2

    0 50 100 150 200

    0.01

    0.02

    0.03

    0.04

    Distribution of τ̂T and τ̂R

    k

    prop

    ortio

    n

    p = 10 , ρ = 0.9

    RknTkn

    2

    190 192 194 196 198

    0.01

    0.02

    0.03

    0.04

    Distribution of τ̂T and τ̂R

    k

    prop

    ortio

    n

    p = 10 , ρ = 0 (zoomed in)

    RknTkn

    2

    190 192 194 196 198

    0.01

    0.02

    0.03

    0.04

    Distribution of τ̂T and τ̂R

    k

    prop

    ortio

    n

    p = 10 , ρ = 0.9 (zoomed in)

    RknTkn

    2

    Mark Holland (UMN) Nonparametric change point model 12

  • Quarantine

    Problem: Distribution of τ̂ depends on dependence structure of dataI Distribution only depends on dependence structure when split point is

    near the boundary of the sequence of data

    Solution: Quarantine, that is restrict search for a change point tointerior of sequence.

    Mark Holland (UMN) Nonparametric change point model 13

  • Quarantined Phase-II SPC procedure

    To use Rk,n for phase-II SPC:

    Collect observation Xn and compute

    Rmax,n,c = maxc hn,α,p,c |Rmax,j,α,c ≤ hj,α,p,c ; j < n] = α.

    Use Monte Carlo simulation to obtain sequence of control limits, {hn,α,p,c}.

    Mark Holland (UMN) Nonparametric change point model 14

  • Control limits

    n

    h α

    Control limits for phase−II directional rank procedure (p = 5, c = 15)

    33 100 200 300 400 500

    1214

    1618

    2022

    24

    α = 1/100

    α = 1/200

    α = 1/500

    α = 1/1000

    α = 1/2000

    Mark Holland (UMN) Nonparametric change point model 15

  • Average run length (ARL) as a performance metric

    The average run length (ARL) of a phase-II SPC procedure is theaverage number of observations collected before the first signal occurs.

    Design phase-II SPC procedure to control in control ARL to aminimum value, 1/α.

    Subject to constraint on in control (IC) ARL, we would like tominimize out of control (OOC) ARL.

    Similar to common goal in hypothesis testingI Minimize Type-II error rate given that Type-I error rate is controlled to

    level α.

    Mark Holland (UMN) Nonparametric change point model 16

  • In control ARL simulation results

    Simulated equicorrelated data with correlation ρ = 0, 0.5, 0.9Default quarantine values: c = 9 for p = 2; c = 15 for p = 5, 10

    Multivariate Normal Data:I Default quarantine is sufficient to achieve IC ARL within 10% of

    nominal for all values of p and ρ considered

    Multivariate Gamma Data:I Positive, right-skewed distribution. Not elliptically symmetric.I Default quarantine is sufficient to achieve IC ARL within 10% of

    nominal, except when p = 5, 10 and ρ = 0.9

    Multivariate Cauchy Data:I Symmetric distribution, much heavier tails than MVN distribution.I Default quarantine is sufficient to achieve IC ARL within 10% of

    nominal, except when p = 10 and ρ = 0.5, 0.9

    Mark Holland (UMN) Nonparametric change point model 17

  • Out of control simulation methodology

    1 Simulate n = 32 equicorrelated in control observations from themultivariate normal distribution with p = 5 and mean vector µ = 0.

    2 Introduce mean vector shift δ = (δ, . . . , δ)T and begin monitoringwith quarantine c = 15 at observation n = 33.

    3 Simulate data sequence until signal occurs using control limits chosento achieve in control ARL 1/α = 500.

    4 record run length = number of observations collected sincemonitoring began.

    5 Repeat for 100,000 simulated data sequences and compute ARL.

    Mark Holland (UMN) Nonparametric change point model 18

  • Effect of quarantine on out of control ARL

    0.0 0.5 1.0 1.5 2.0 2.5 3.0

    23

    45

    6

    shift vector length

    log(

    AR

    L)

    Quarantined directional rank OOC ARL, p = 5

    ● ●●

    ●●

    ●●

    ●●

    ●● ● ● ● ● ●

    ● c = 0c = 3c = 9c = 15

    Mark Holland (UMN) Nonparametric change point model 19

  • Performance comparison with parametric method

    0.0 0.5 1.0 1.5 2.0 2.5 3.0

    12

    34

    56

    shift vector length (Mahalanobis distance)

    log(

    AR

    L)

    Rkn vs. ZH OOC ARL, p = 5

    ● ●●

    ●●

    ● ● ● ● ● ● ● ● ● ● ● ●

    ● Rkn rho = 0, c = 15Rkn rho = 0.9, c = 15ZH parametric

    Mark Holland (UMN) Nonparametric change point model 20

  • Diagnostic to select degree of quarantine

    Based on copula function: A copula is a p-dimensional distributionfunction on [0, 1]p with uniform univariate marginal distributions.

    Sklar’s theorem: any p-dimensional distribution function isassociated with a unique copula function.

    Copula can therefore be used to characterize the dependence betweenthe components of a random vector.

    Diagnostic based on Anderson-Darling test for Goodness-of-Fit ofmultivariate normal copula.

    Mark Holland (UMN) Nonparametric change point model 21

  • Analysis of Aluminum Smelter Data

    40 50 60 70 80 90

    1020

    3040

    observation

    ● ●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ● ●● ●

    ●● ●

    ●●

    ●● ● ●

    ● ●●

    ● ●●

    ●●

    ●●

    ● ● ● ●

    ●● ●

    Analysis of Aluminum Smelter Data

    ● Rmaxcontrol limit, hn

    In control ARL: 1/α = 500Control limit exceeded at observation n = 71Estimated shift location τ̂R,n = 55Mark Holland (UMN) Nonparametric change point model 22

  • Analysis of Aluminum Smelter Data

    0 10 20 30 40 50 60 70

    57.0

    57.5

    58.0

    58.5

    Al2O3

    observation

    %

    0 10 20 30 40 50 60 700.

    20.

    40.

    60.

    81.

    01.

    21.

    4

    SiO2

    observation

    %

    0 10 20 30 40 50 60 70

    23.5

    24.0

    24.5

    25.0

    25.5

    26.0

    Fe2O3

    observation

    %0 10 20 30 40 50 60 70

    12.0

    12.5

    13.0

    13.5

    14.0

    MgO

    observation

    %

    0 10 20 30 40 50 60 70

    3.5

    4.0

    4.5

    5.0

    CaO

    observation

    %

    Mark Holland (UMN) Nonparametric change point model 23

  • Summary

    Traditional SPC methods are not suitable for some modernapplications

    Change point model for phase-II SPC does not require phase-Itraining sample

    Nonparametric multivariate change point model:I Does not require assumption of multivariate normalityI Outperforms parametric method for small to moderate shift sizes, even

    when data follows multivariate normal distributionI Detects large shifts slower than parametric method

    Mark Holland (UMN) Nonparametric change point model 24

  • References

    Choi, K. and Marden, J. (1997). An approach to multivariate rank tests inmultivariate analysis of variance. Journal of the American Statistical Association92(440), pp. 1581 - 1590.

    Deng, Q. (2009). A nonparametric change-point model for phase II analysis. PhDthesis. University of Minnesota.

    Hawkins, D. M., Qiu, P., and Kang, C. W. (2003). The Changepoint Model forStatistical Process Control. Journal of Quality Technology 35(4), pp. 355-366.

    Thor, J., Lundberg, J., Ask, J. Olsson, J. Carli, C., Harenstam, K., Brommels, M.(2007). Application of statistical process control in healthcare improvement:systematic review. Quality and Safety in Health Care 16, pp. 387-399.

    Zamba, K. D. and Hawkins, D. M. (2006). A multivariate change-point model for

    statistical process control. Technometrics 48(4), pp. 539-549.

    Mark Holland (UMN) Nonparametric change point model 25

  • Assumptions required for asymptotic result for Choi and Marden (1997)test statistic:

    Under the Null Hypothesis,

    Λ = cov(Dij) and Ω = cov(Dij ,Dil)

    are finite and positive definite when i , j , and l are all distinct.

    k/n→ λ0 ∈ (0, 1)

    Mark Holland (UMN) Nonparametric change point model 26

  • Multivariate gamma distribution

    Let Y0,Y1, . . . ,Yp be independent gamma random variables withpdf’s

    pYi (yi ) =1

    Γ(θi )e−yi yθi−1i , yi > 0, θi > 0.

    Define X = (Y0 + Y1,Y0 + Y2, . . . ,Y0 + Yp)T .

    Marginal distribution of each Xi is a univariate gamma distributionwith shape parameter θ0 + θi .

    ρij = corr(Xi ,Xj) =θ0√

    (θ0 + θi )(θ0 + θj).

    Mark Holland (UMN) Nonparametric change point model 27

  • Multivariate Cauchy distribution

    Let Y ∼ Np(µ,Σ) and let w ∼ χ2ν .Define

    X =1√w/ν

    Y.

    Then, X follows the multivariate T distribution.

    If ν = 1, X follows the multivariate Cauchy distribution.

    Mark Holland (UMN) Nonparametric change point model 28

    BackgroundDefinitionsApplicationsStandard tools

    Change point models in phase-II statistical process controlGeneral frameworkMultivariate nonparametric location testMultivariate nonparametric change point model

    Evaluation of performanceIn control performanceOut of control performance

    Analysis of Aluminum Smelter Data