[] statistics - statistical methods for data analytic

Upload: siwi-awalian

Post on 04-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    1/66

    Chap

    ter3

    StatisticalMethods

    PaulC.

    Taylor

    University

    ofH

    ertford

    shire

    28thM

    arch2

    001

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    2/66

    3.1Introduction

    Generalize

    dLin

    ear

    Models

    Sp

    ecialTo

    picsinR

    egressionM

    odelling

    Cl

    assicalM

    ultivaria

    teAn

    alysis

    Su

    mmary

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    3/66

    3.2Generalized

    LinearMo

    dels

    Regression

    An

    alysisofV

    arian

    c

    e

    Lo

    g-linearM

    odels

    Lo

    gisticR

    egre

    ssion

    An

    alysisof

    Surviva

    lData

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    4/66

    Th

    efitting

    of

    gen

    eralize

    dlinearm

    odelsi

    scurren

    tlythem

    ostfre

    quen

    tlyapplied

    statisticaltechniq

    ue.

    Generalize

    dlin

    earm

    odels

    are

    used

    todescribedthe

    rela-

    tion

    shi

    pbetween

    them

    ean,som

    etimesc

    alledthetrend,o

    fonevaria

    ble

    an

    dthe

    valu

    es

    taken

    byseveral

    othervaria

    ble

    s.

    3

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    5/66

    3.2.1

    Regression

    Howis

    avaria

    ble

    ,

    ,rel

    atedtoon

    e,orm

    ore,othervaria

    bl

    es,

    ,

    ,...,

    ?

    Names

    for

    :

    re

    sponse;depende

    ntvariable;output.

    Names

    forthe

    s:

    re

    gressors

    ;explanatoryvariables;in

    dependentvariables

    ;inputs.

    Here

    ,w

    ewillu

    setheterms

    outputan

    din

    puts.

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    6/66

    Comm

    onreason

    sfor

    doing

    are

    gre

    ssion

    analysisin

    clude:

    theoutputis

    exp

    en

    sivetom

    easure

    ,butthein

    putsa

    renot,an

    dsocheap

    pre

    diction

    sof

    theo

    utputare

    sough

    t;

    thevalu

    esof

    thein

    putsarekn

    own

    e

    arlier

    than

    theou

    tputis,

    an

    daw

    orking

    pre

    diction

    of

    theou

    tputisre

    quire

    d;

    wecan

    con

    trol

    the

    valuesof

    thein

    puts,w

    ebelieve

    thereis

    acausallink

    be

    tween

    thein

    puts

    andtheoutput,

    andsow

    ew

    an

    t

    tokn

    owwh

    atva

    lues

    of

    thein

    putsshoul

    dbechosen

    toobtain

    aparticula

    rtarg

    etvalu

    efor

    the

    ou

    tput;

    iti

    sbelieve

    dthatth

    ereis

    acausallinkb

    etween

    som

    e

    ofthein

    putsan

    dthe

    ou

    tput,an

    dw

    ewish

    toid

    en

    tifywhichinp

    utsarerela

    te

    dtotheoutput.

    5

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    7/66

    Th

    e(g

    eneral)linearmo

    delis

    (3.1)

    wh

    ere

    the

    sarein

    dependen

    tlyan

    did

    e

    ntically

    distrib

    ute

    das

    an

    d

    isthen

    umber

    of

    datapoints.

    Th

    em

    odelislinearin

    the

    s.

    (3.2)

    (Aw

    eighte

    dsum

    of

    the

    s.)

    6

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    8/66

    Th

    em

    ainre

    ason

    sfor

    th

    euseof

    thelin

    earm

    odel.

    Th

    emaxim

    umlikelihoo

    destimators

    ofthe

    sare

    thesam

    easthel

    east

    sq

    uaresestimators

    ;seeSection2.4

    ofCh

    apter2.

    Ex

    plicitform

    ula

    ean

    drapid

    ,relia

    blen

    umericalm

    ethodsforfin

    din

    gthel

    east

    sq

    uaresestimators

    ofthe

    s.

    Many

    pro

    blem

    scan

    befram

    edasge

    nerallin

    earm

    odels.F

    or

    exam

    ple

    ,

    (3.3)

    ca

    nbeconverte

    db

    ysetting

    ,

    an

    d

    .

    Ev

    enwh

    en

    thelin

    e

    armodelisn

    ots

    trictlyappro

    pria

    te

    ,thereis

    often

    a

    way

    to

    transform

    theoutputan

    d/or

    thein

    puts,sothatalin

    earmodelcan

    pro

    vide

    us

    efulinform

    ation.

    7

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    9/66

    Non-linearRegression

    Twoex

    ample

    sare:

    (3.4)

    (3.5)

    wh

    ere

    thesan

    d

    are

    asin

    (3.1

    ).

    8

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    10/66

    Pro

    ble

    ms

    1.E

    s

    timationis

    carrie

    doutusingitera

    tivemethodswhich

    require

    goodcho

    ices

    of

    startin

    gvalu

    es,migh

    tn

    otconverg

    e,migh

    tconverg

    etoalo

    cal

    optimum

    rather

    than

    theglo

    b

    aloptimum

    ,an

    dwillre

    quireh

    um

    anin

    terven

    tion

    toover-

    co

    methesedifficulties.

    2.Th

    estatistical

    pro

    p

    ertiesof

    theestimatesan

    dpre

    dic

    tionsfrom

    them

    odel

    areno

    tkn

    own

    ,so

    wecann

    otperform

    statisticalinf

    erenceforn

    on-linear

    regression.

    9

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    11/66

    GeneralizedLinearMo

    dels

    Th

    ege

    neraliza

    tionisin

    twoparts.

    1.Th

    edistrib

    ution

    of

    theoutputdoesn

    oth

    ave

    tobeth

    enorm

    al,

    butca

    nbe

    an

    yofthedistrib

    utionsin

    theexp

    on

    entialfamily.

    2.In

    steadof

    theexp

    ectedvalu

    eof

    theoutputbein

    ga

    linearfun

    ction

    o

    fthe

    s,weh

    ave

    (3.6)

    wh

    ere

    isam

    on

    oton

    edifferen

    tiablefun

    ction.Th

    e

    function

    iscalled

    thelinkfun

    ction.

    Th

    ere

    isarelia

    ble

    gen

    e

    ralalg

    orithmforfi

    ttinggen

    eralize

    d

    linearm

    odels.

    1

    0

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    12/66

    GeneralizedAdditiveM

    odels

    Gen

    eralize

    dadditivem

    odelsare

    agen

    er

    alization

    of

    gen

    er

    alizedlin

    earm

    od

    els.

    Th

    ege

    neraliza

    tionis

    that

    needno

    tbealin

    earfunction

    of

    asetof

    s,

    buth

    astheform

    (3.7)

    wh

    ere

    the

    sare

    arbitr

    ary,usually

    smooth,fun

    ction

    s.

    An

    example

    of

    them

    o

    delpro

    ducedus

    ingatypeof

    sc

    atterplo

    tsmootheris

    shown

    inFig

    ure

    3.1.

    1

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    13/66

    Dia

    betesD

    ata--

    -Splin

    eSm

    ooth

    df=3

    Age

    Log C-peptide

    5

    10

    15

    3 4 5 6

    Fig

    ure

    3.1

    1

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    14/66

    Methodsforfittin

    ggen

    e

    ralizedadditivem

    odels

    exist

    an

    dareg

    en

    erallyrelia

    ble.

    Th

    em

    aindraw

    backis

    thatthefram

    ew

    o

    rkof

    statisticalinferen

    cethatis

    a

    vail-

    ablefo

    rgen

    eralize

    dlin

    earm

    odelsh

    asn

    otyetbeen

    deve

    lopedfor

    gen

    era

    lized

    additiv

    emodels.

    Despite

    this

    draw

    back,

    generalize

    daddi

    tivem

    odels

    can

    befittedbysever

    alof

    them

    a

    jorstatisticalp

    ac

    kagesalre

    ady.

    1

    3

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    15/66

    3.2.2

    AnalysisofV

    ariance

    Th

    eanalysisofvarianc

    e,orAN

    OVA

    ,is

    prim

    arily

    am

    etho

    dofid

    en

    tifyingw

    hich

    of

    the

    sin

    alin

    earm

    o

    delaren

    on-zero.This

    techniq

    uew

    asdevelo

    pedfo

    rthe

    an

    alys

    isof

    agriculturalfi

    eldexp

    erim

    en

    ts,butisn

    ow

    used

    quitegen

    erally.

    Example27TurnipsforWinterFodder.

    ThedatainTa

    ble

    3.1arefrom

    an

    ex-

    perim

    e

    nttoinve

    stigate

    thegrow

    thof

    tur

    nips.Th

    esetype

    sof

    turnip

    sw

    oul

    dbe

    grown

    toprovid

    efo

    odf

    orfarm

    anim

    als

    inwin

    ter.Th

    etu

    rnipsw

    ereh

    arve

    sted

    an

    dw

    eigh

    edbystaff

    an

    dstuden

    tsof

    th

    eDepartm

    en

    tso

    fAgriculture

    an

    d

    Ap-

    plie

    dS

    tatisticsofTh

    eU

    niversity

    ofR

    ead

    ing,in

    October,1

    990.

    1

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    16/66

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    17/66

    Th

    efollowin

    glin

    earm

    odel

    (3.8)

    or

    an

    e

    quivalen

    ton

    eco

    uldbefitte

    dtoth

    esedata.Th

    ein

    putstake

    thevalu

    es0

    or1

    an

    dare

    usually

    calleddummyorindicatorvaria

    ble

    s.

    Onfirs

    tsigh

    t,(3.8

    )shou

    ldalsoin

    cludea

    an

    da

    ,but

    wedon

    otn

    eedthem.

    1

    6

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    18/66

    Th

    efirs

    tquestion

    thatw

    ewould

    trytoan

    swer

    aboutthese

    datais

    Doesachangeintreatmentproduceachangeinthe

    turnipyield?

    whichi

    sequivalen

    ttoaskin

    g

    Areanyof

    ,

    ,

    ...,

    non-zero

    ?

    whichi

    sthesort

    of

    questionthatcan

    be

    answere

    dusingANOVA.

    1

    7

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    19/66

    Thisis

    howtheAN

    OVA

    works.R

    ecall,

    th

    egen

    erallin

    earm

    odelof

    (3.1

    ),

    Th

    ees

    timateof

    is

    .

    Fitte

    dv

    alues

    (3.9)

    Residu

    als

    (3.10

    )

    Th

    esizeof

    there

    sidual

    sisrela

    tedtothesizeof

    ,thev

    arian

    ceof

    the

    s.It

    turn

    so

    utthatw

    ecan

    estimate

    by

    (3.11

    )

    1

    8

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    20/66

    Th

    eke

    yfactsabout

    isthatallow

    usto

    com

    pare

    differe

    ntlinearm

    odels

    are:

    ifthefitte

    dm

    odelis

    adequate(therigh

    ton

    e),

    then

    isagoodestimate

    of

    ;

    ifthefitte

    dm

    odelin

    cludesre

    dun

    dan

    tterm

    s(thatisin

    cludessom

    e

    s

    that

    arere

    allyzero

    ),the

    n

    isstillagoo

    destimateof

    ;

    ifthefitte

    dm

    odeld

    oesn

    otin

    cludeo

    neorm

    orein

    putsthatit

    ough

    tto,

    then

    willten

    dtobela

    rger

    than

    thetruevalu

    eof

    .

    Soifw

    eomit

    ausefulinpu

    tfrom

    ourm

    odel,

    theestimateof

    will

    shoo

    tup,

    wh

    ere

    asifw

    eomit

    are

    dundan

    tin

    putfrom

    ourm

    odel,

    the

    estimateof

    sh

    ould

    notchang

    em

    uch.N

    ote

    thatomittin

    gon

    eofthein

    putsfro

    mthem

    odelis

    equiv-

    alen

    ttoforcin

    gthecorre

    spon

    din

    g

    tobezero.

    1

    9

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    21/66

    Example28TurnipsforWinterFoddercontinued.L

    et

    tobethem

    od

    elat

    (3.8

    ),and

    tobethe

    followin

    gm

    odel

    (3.18

    )

    So,

    isthespecialca

    seof

    inwhich

    allof

    ,

    ,...,

    arezero.

    Table

    3.2

    Df

    Su

    m

    of

    Sq

    Mean

    Sq

    F

    Value

    Pr(F)

    block

    3

    163.737

    54.57891

    2.278016

    0.08867543

    Residuals

    60

    1

    437.538

    23.95897

    Table

    3.3

    Df

    Su

    m

    of

    Sq

    Mean

    Sq

    F

    Value

    Pr(F)

    block

    3

    163.737

    54.57891

    5.690430

    0.002163810

    treat

    15

    1

    005.927

    67.06182

    6.991906

    0.000000171

    Residuals

    45

    431.611

    9.59135

    2

    0

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    22/66

    Table

    3

    .4show

    stheAN

    OVAthatw

    ould

    u

    sually

    bepro

    duc

    edfor

    theturnip

    data.

    Notice

    thattheblo

    cka

    ndR

    esidualsro

    wsare

    thesam

    e

    asinTa

    ble

    3.3.

    The

    basicdifferen

    cebetwee

    nTable

    s3.3

    an

    d

    3.4is

    thatthetre

    atmen

    tinform

    ationis

    broken

    downin

    toits

    con

    stituen

    tpartsin

    Table

    3.4.

    Table

    3.4

    Df

    Sum

    o

    f

    Sq

    Mean

    S

    q

    F

    Value

    Pr(F)

    block

    3

    163.

    7367

    54.578

    9

    5.69043

    0.

    0021638

    variet

    y

    1

    83.

    9514

    83.951

    4

    8.75282

    0.

    0049136

    sowing

    1

    233.

    7077

    233.707

    724.36650

    0.

    0000114

    densit

    y

    3

    470.

    3780

    156.792

    716.34730

    0.

    0000003

    variet

    y:sowing

    1

    36.

    4514

    36.451

    4

    3.80045

    0.

    0574875

    variet

    y:density

    3

    8.

    6467

    2.882

    2

    0.30050

    0.

    8248459

    sowing

    :density

    3

    154.

    7930

    51.597

    7

    5.37960

    0.

    0029884

    variet

    y:sowing:den

    sity

    3

    17.

    9992

    5.999

    7

    0.62554

    0.

    6022439

    Residu

    als

    45

    431.

    6108

    9.591

    4

    2

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    23/66

    3.2.3

    Log-linearModels

    Th

    edatashowninTa

    b

    le3.7

    show

    thesortof

    pro

    blem

    attackedbylo

    g-linear

    modelling.Th

    ere

    arefiv

    ecategoricalvar

    iablesdispla

    yedinTa

    ble

    3.7:

    centre

    oneof

    thre

    eh

    ealth

    cen

    tresfor

    th

    etreatmen

    tof

    bre

    astcan

    cer;

    ageth

    eageof

    thepatientwh

    enh

    er

    bre

    astcan

    cerw

    asdi

    agnosed;

    surviv

    edwh

    ether

    thepatien

    tsurvive

    dfora

    tle

    astthre

    ey

    earsfrom

    dia

    gn

    o

    sis;

    appear

    appearan

    ceof

    thepatien

    tstum

    oureith

    ermalig

    nantorbenign

    ;

    inflam

    amoun

    tofinflam

    mation

    of

    thetum

    our

    eith

    ermin

    imal

    orgreater.

    2

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    24/66

    Table

    3.7

    StateofT

    umour

    Minim

    alInflamm

    ation

    Gre

    aterInflamm

    ation

    Malign

    an

    t

    Benign

    Malign

    an

    t

    Benign

    Cen

    tre

    Age

    Survived

    Appearan

    ce

    Appearan

    ce

    Appearan

    ce

    Appearan

    ce

    Tokyo

    Un

    der

    50

    No

    9

    7

    4

    3

    Yes

    26

    68

    25

    9

    50

    69

    No

    9

    9

    11

    2

    Yes

    20

    46

    18

    5

    70or

    over

    No

    2

    3

    1

    0

    Yes

    1

    6

    5

    1

    Boston

    Un

    der

    50

    No

    6

    7

    6

    0

    Yes

    11

    24

    4

    0

    50

    69

    No

    8

    20

    3

    2

    Yes

    18

    58

    10

    3

    70or

    over

    No

    9

    18

    3

    0

    Yes

    15

    26

    1

    1

    Glam

    or

    gan

    Un

    der

    50

    No

    16

    7

    3

    0

    Yes

    16

    20

    8

    1

    50

    69

    No

    14

    12

    3

    0

    Yes

    27

    39

    10

    4

    70or

    over

    No

    3

    7

    3

    0

    Yes

    12

    11

    4

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    25/66

    For

    the

    sedata,theoutp

    utisthen

    um

    ber

    ofpatien

    tsin

    eachcell.

    Th

    em

    odelis

    (3.21

    )

    Sin

    ceallth

    evaria

    ble

    so

    fintere

    stare

    categorical,w

    en

    eed

    tousein

    dica

    tor

    vari-

    able

    sasinputsin

    thesamew

    ayasin

    (3.

    8).

    2

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    26/66

    Table

    3.8

    Terms

    added

    seq

    uentially

    (first

    to

    last)

    Df

    Deviance

    Resid.

    Df

    Re

    sid.

    Dev

    Pr(Chi)

    NULL

    71

    860.0076

    centre

    2

    9.3619

    69

    850.6457

    0.0092701

    age

    2

    105.5350

    67

    745.1107

    0.0000000

    survived

    1

    160.6009

    66

    584.5097

    0.0000000

    inflam

    1

    291.1986

    65

    293.3111

    0.0000000

    appear

    1

    7.5727

    64

    285.7384

    0.0059258

    centre:age

    4

    76.9628

    60

    208.7756

    0.0000000

    centre:survived

    2

    11.2698

    58

    197.5058

    0.0035711

    centre:inflam

    2

    23.2484

    56

    174.2574

    0.0000089

    centre:appear

    2

    13.3323

    54

    160.9251

    0.0012733

    age:survived

    2

    3.5257

    52

    157.3995

    0.1715588

    age:inflam

    2

    0.2930

    50

    157.1065

    0.8637359

    age:appear

    2

    1.2082

    48

    155.8983

    0.5465675

    survived:inflam

    1

    0.9645

    47

    154.9338

    0.3260609

    survived:appear

    1

    9.6709

    46

    145.2629

    0.0018721

    inflam:appear

    1

    95.4381

    45

    49.8248

    0.0000000

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    27/66

    Tosum

    marise

    thism

    odel,Iw

    ould

    con

    structits

    con

    dition

    alind

    epen

    den

    ceg

    raph

    an

    dpre

    sen

    ttable

    scorre

    spon

    din

    gtothe

    intera

    ction

    s.

    Table

    s

    arein

    thebook.

    Th

    eco

    ndition

    alin

    depen

    dencegra

    phis

    s

    howninFig

    ure

    3.2.

    age

    centre

    su

    rvived

    inflam

    appear

    Fig

    ure

    3.2

    2

    6

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    28/66

    3.2.4

    LogisticRegression

    Inlo

    gis

    ticre

    gre

    ssion

    ,th

    eoutputis

    then

    umb

    er

    of

    success

    esoutof

    an

    um

    b

    erof

    trials,

    eachtrialre

    sultin

    gin

    eith

    er

    asuccessorfailure.

    For

    the

    breastcan

    cer

    data,w

    ecanre

    gar

    deachpatien

    tas

    atrial,with

    suc

    cess

    corre

    spondin

    gtothepa

    tientsurvivin

    gfor

    threeyears.

    Th

    eoutputw

    ould

    simp

    lybegiven

    asn

    umb

    er

    of

    successes,eith

    er

    0or1

    ,for

    eacho

    fthe7

    64

    patien

    tsinvolve

    din

    thes

    tudy.

    Th

    em

    odelth

    atw

    ewillfi

    tis

    and

    (3.22

    )

    Again

    ,

    thein

    putsh

    ere

    willbein

    dica

    tor

    sfor

    thebre

    ast

    cancer

    data,butthis

    isn

    ot

    generally

    true;th

    ereisn

    ore

    ason

    whyan

    yof

    the

    inputsshouldn

    otbe

    quan

    titative.

    2

    7

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    29/66

    Table

    3.15

    Df

    Deviance

    Resid.

    Df

    Resid

    .

    Dev

    Pr(Chi)

    NU

    LL

    763

    898

    .5279

    cent

    re

    2

    11.26979

    761

    887

    .2582

    0.0035711

    a

    ge

    2

    3.52566

    759

    883

    .7325

    0.1715588

    appe

    ar

    1

    9.69100

    758

    874

    .0415

    0.0018517

    infl

    am

    1

    0.00653

    757

    874

    .0350

    0.9356046

    centre:a

    ge

    4

    7.42101

    753

    866

    .6140

    0.1152433

    centre:appe

    ar

    2

    1.08077

    751

    865

    .5332

    0.5825254

    centre:infl

    am

    2

    3.39128

    749

    862

    .1419

    0.1834814

    age:appe

    ar

    2

    2.33029

    747

    859

    .8116

    0.3118773

    age:infl

    am

    2

    0.06318

    745

    859

    .7484

    0.9689052

    appear:infl

    am

    1

    0.24812

    744

    859

    .5003

    0.6184041

    centre:age:appe

    ar

    4

    2.04635

    740

    857

    .4540

    0.7272344

    centre:age:infl

    am

    4

    7.04411

    736

    850

    .4099

    0.1335756

    cen

    tre:appear:infl

    am

    2

    5.07840

    734

    845

    .3315

    0.0789294

    age:appear:infl

    am

    2

    4.34374

    732

    840

    .9877

    0.1139642

    centre:

    age:appear:infl

    am

    3

    0.01535

    729

    840

    .9724

    0.99949642

    8

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    30/66

    Th

    efit

    tedm

    odelis

    sim

    pleen

    oughin

    thi

    scasefor

    thepa

    rameter

    estimatesto

    bein

    cludedh

    ere

    ;they

    areshownin

    the

    formthatastatistical

    packagew

    ould

    pre

    sen

    ttheminTa

    ble

    3.16.

    Table

    3.16

    Coefficients:

    (Intercept)

    centre2

    centre3

    ap

    pear

    1.080257

    -0

    .6589141

    -0.4944846

    0.515

    7151

    Usingthe

    estimatesgiveninTa

    ble

    3.1

    6,thefitte

    dm

    odelis

    (3.23

    )

    2

    9

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    31/66

    3.2.5

    AnalysisofS

    urvivalData

    Survivalda

    taare

    datac

    oncernin

    gh

    owlo

    ngittake

    sfor

    ap

    articular

    even

    tto

    hap-

    pen.In

    man

    ym

    edicala

    pplication

    stheev

    entis

    deathof

    apatien

    twith

    anilln

    ess,

    an

    dso

    weare

    an

    alysin

    gthepatien

    tssur

    vivaltim

    e.Inin

    du

    striala

    pplica

    tion

    sthe

    even

    ti

    softenfailure

    of

    acom

    pon

    en

    tin

    a

    machin

    e.

    Th

    eoutputin

    this

    sort

    ofpro

    blemis

    th

    esurvival

    time.

    Aswith

    all

    theother

    pro

    blem

    sthatw

    eh

    ave

    seenin

    this

    section,

    thetaskis

    tofi

    tare

    gre

    ssionm

    odel

    todescribe

    therela

    tion

    s

    hipbetween

    theoutputan

    dsom

    e

    inputs.In

    them

    e

    dical

    con

    tex

    t,thein

    putsare

    usually

    qualitie

    softh

    epatien

    t,sucha

    sagean

    dse

    x,or

    are

    determin

    edbythetreatmen

    tgiven

    to

    thepatien

    t.

    Wewillskip

    this

    topic.

    3

    0

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    32/66

    3.3SpecialTop

    icsinRegr

    essionMod

    elling

    Multivaria

    teAn

    alysi

    sofV

    arian

    ce

    RepeatedM

    easure

    sData

    RandomEffe

    ctsM

    odels

    Th

    eto

    picsin

    this

    sectionare

    specialin

    thesen

    sethattheyare

    exten

    sion

    sto

    theba

    sicid

    eaofre

    gre

    ssionm

    odellin

    g.

    Thetechniq

    uesh

    avebeen

    develo

    ped

    inre

    sp

    onsetom

    ethodsofdatacolle

    ctioninwhich

    theusual

    assum

    ption

    sof

    regre

    ssionm

    odellin

    gar

    enotjustified.

    3

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    33/66

    3.3.1

    MultivariateA

    nalysisofVa

    riance

    Model

    (3.26

    )

    wh

    ere

    the

    sarein

    dependen

    tlyan

    did

    e

    ntically

    distrib

    utedas

    an

    d

    isthen

    umber

    of

    datapoints.Th

    e

    under

    indica

    testhedim

    en

    sion

    sof

    theve

    ctor,in

    this

    case

    rowsan

    d1

    colu

    mn;the

    sare

    a

    lso

    vecto

    rs.

    Thism

    odel

    can

    befitte

    dinexa

    ctlythesamew

    ayasalinearm

    odel

    (byl

    east

    square

    sestimation

    ).On

    ewaytodothis

    fittingw

    ould

    betofitalin

    earm

    od

    elto

    eacho

    fthe

    dim

    en

    sion

    softheoutput,o

    ne-at-a-tim

    e.

    3

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    34/66

    Havin

    g

    fittedthem

    odel,we

    can

    obtainfit

    tedvalu

    es

    an

    dh

    e

    ncere

    siduals

    Th

    ean

    alogueof

    there

    sidu

    al

    sum

    of

    squ

    aresfrom

    the(un

    ivariate)lin

    earm

    odel

    isthem

    atrixofre

    sidual

    sumsof

    square

    s

    andpro

    ductsfor

    them

    ultivaria

    telinear

    model.Thism

    atrixis

    define

    dtobe

    3

    3

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    35/66

    3.3.2

    RepeatedMe

    asuresData

    Repea

    tedm

    easure

    sda

    taare

    gen

    era

    ted

    when

    theoutputvaria

    bleis

    obse

    rved

    atseveral

    poin

    tsin

    time,on

    thesam

    ein

    dividuals.

    Usually,th

    ecovaria

    tesare

    also

    observe

    datthesame

    timepoin

    ts

    astheoutput;sothein

    putsare

    time-

    depen

    denttoo.Th

    us,

    asin

    Section

    3.3

    .1theoutputis

    avector

    ofm

    easure-

    men

    ts.In

    prin

    ciple

    ,w

    ecan

    simply

    apply

    thetechniq

    uesof

    Section

    3.3.1to

    an

    alys

    erepeatedm

    easuresdata.In

    ste

    ad,w

    eusually

    trytousethefa

    ctthat

    weh

    av

    ethesam

    esetofvaria

    ble

    s(outp

    utan

    din

    puts)atseveral

    times,ra

    ther

    than

    a

    collection

    of

    diffe

    rentvaria

    ble

    sm

    a

    kingupave

    ctor

    output.

    Repea

    tedm

    easure

    sdataare

    often

    calle

    dlongitudinaldata,especiallyin

    theso-

    cialsci

    ences.Th

    eterm

    cross-sectionalis

    often

    usedtom

    eann

    otlon

    gitu

    d

    inal.

    3

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    36/66

    3.3.3

    Random

    Effe

    ctsModels

    Overdispersion

    Inalo

    gisticre

    gre

    ssionw

    emigh

    tre

    pla

    ce

    (3.22

    )with

    (3.29

    )

    wh

    ere

    the

    sarein

    de

    penden

    tlyan

    did

    entically

    distrib

    utedas

    .We

    can

    think

    of

    asre

    pr

    esen

    tingeith

    er

    theeffe

    ctof

    them

    issingin

    puton

    or

    simply

    asran

    domvaria

    tionin

    thesucces

    spro

    babilitie

    sforindivid

    uals

    thath

    ave

    thesam

    evalu

    esfor

    the

    inputvaria

    ble

    s.

    3

    5

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    37/66

    Hierarchicalmodels

    Inthe

    turnip

    exp

    erim

    en

    t,thegrow

    thof

    theturnip

    sis

    affe

    ctedbythedifferen

    t

    blo

    cks,buttheeffe

    cts(the

    s)for

    eachb

    lockarelikely

    tobedifferen

    tin

    differen

    t

    years.

    Sow

    ecould

    thin

    kofthe

    sfor

    eachblo

    ckascom

    ingfrom

    apopula

    tion

    of

    sf

    orblo

    cks.Ifw

    ed

    idthis,

    thenw

    ec

    ouldre

    pla

    cethem

    odelin

    (3.8

    )with

    (3.30

    )

    wh

    ere

    ,

    ,

    an

    d

    arein

    dependen

    tlyan

    did

    en

    tically

    distrib

    ute

    das

    .

    3

    6

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    38/66

    3.4ClassicalM

    ultivariateAnalysis

    Pr

    incipalC

    om

    pon

    e

    ntsAn

    alysis

    Corre

    spon

    den

    ceAn

    alysis

    Multidim

    en

    sion

    alS

    caling

    Cl

    usterAn

    alysis

    an

    dMixtureD

    ecom

    position

    La

    tentV

    aria

    ble

    an

    d

    Covarian

    ceStru

    ctureM

    odels

    3

    7

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    39/66

    3.4.1

    PrincipalCom

    ponentsAnalysis

    Prin

    cip

    alcom

    pon

    en

    tsa

    nalysisis

    aw

    ay

    oftran

    sformin

    ga

    setof

    -dim

    en

    sional

    vector

    observa

    tion

    s,

    ,,...,

    ,in

    toanother

    setof

    -dimen

    sion

    alve

    c

    tors,

    ,

    ,...,

    .Th

    e

    shave

    thepro

    per

    tythatm

    ostof

    theirinform

    ation

    con

    tent

    isstore

    dinthefirstfew

    dimen

    sion

    s(features).

    Thisw

    illallow

    dim

    en

    sio

    nalityre

    duction

    ,sothatw

    ecan

    do

    thingslike:

    ob

    tainin

    g(inform

    ative)

    gra

    phical

    displays

    of

    thedatain2-D

    ;

    ca

    rryingoutcom

    pu

    terin

    ten

    sivem

    ethodsonre

    duced

    data;

    ga

    iningin

    sigh

    tin

    to

    thestructure

    of

    the

    data,whichw

    asn

    otapparen

    t

    in

    dim

    ension

    s.

    3

    8

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    40/66

    SepalL.

    2.02.53.03.54.0

    0.51.0

    1.52.02.5

    5 6 7 8

    2.02.53.03.54.0

    SepalW.

    PetalL.

    1 2 3 4 5 6 7

    5

    6

    7

    8

    0.51.01.52.02.5

    1

    23

    4

    5

    6

    7

    Pe

    talW.

    Fig

    ure

    3.3

    Fisher

    sIrisData(colle

    ctedbyAn

    derson

    )

    3

    9

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    41/66

    Th

    em

    ainid

    eabehin

    d

    principal

    com

    pon

    entsan

    alysisis

    thathighinform

    ation

    corre

    spondstohighvariance.

    So,ifw

    ewan

    tedtore

    ducethe

    stoas

    ingledim

    en

    sionw

    ewould

    tran

    sform

    to

    choosing

    sothat

    ha

    sthelarg

    estvariance

    possible.

    Itturn

    s

    outthat

    should

    betheeig

    enve

    c

    torcorre

    spon

    din

    gtothelarg

    estei

    gen-

    valu

    eofth

    evarian

    ce(covarian

    ce)m

    atrix

    of

    ,

    .

    Itis

    als

    opossible

    tosho

    wthatof

    allth

    edirection

    sorth

    ogo

    naltothedire

    ctionof

    high

    es

    tvarian

    ce,the(secon

    d)high

    estvarian

    ceisin

    thed

    irection

    parallelto

    the

    eig

    env

    ector

    of

    thesecon

    dlarg

    esteig

    env

    alueof

    .Th

    ese

    results

    exten

    dallthe

    wayto

    dim

    en

    sion

    s.

    4

    0

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    42/66

    Estima

    teof

    is

    (3.31

    )

    wh

    ere

    .

    Th

    eeig

    envalu

    esof

    are

    Th

    eeig

    enve

    ctors

    o

    fcorre

    spon

    din

    gto

    ,

    ,...,

    are

    ,

    ,...

    ,

    ,

    respectively.

    Th

    evectors

    ,

    ,...,

    are

    called

    theprincipal

    axes.(

    isthe

    first

    princip

    alaxis,

    etc.)

    Th

    e

    matrix

    whose

    thcolum

    nis

    willb

    eden

    otedas

    .

    4

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    43/66

    Th

    epr

    incipalaxe

    s(can

    bean

    d)are

    chos

    ensothattheya

    reoflen

    gth1

    an

    dare

    orth

    og

    onal(p

    erp

    en

    dicu

    lar).Alg

    ebraically

    ,thism

    ean

    sthat

    if

    if

    (3.32

    )

    Th

    eve

    ctor

    defin

    edas

    ,

    ...

    iscalle

    dtheve

    ctor

    ofp

    rincipalcomponentscoresof

    .T

    hethprin

    cipal

    com-

    pon

    en

    tscore

    of

    is

    ;som

    etim

    estheprin

    cipal

    c

    ompon

    en

    tscore

    sare

    referre

    dtoastheprin

    cipalcom

    pon

    en

    ts.

    4

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    44/66

    1.Th

    eelem

    en

    tsof

    areun

    correla

    tedandthesam

    pl

    evarian

    ceof

    the

    th

    princip

    alcom

    pon

    en

    tscoreis

    .In

    o

    therw

    ord

    sthesamplevarian

    cem

    atrix

    of

    is

    ...

    2.Th

    esum

    of

    thesam

    plevarian

    cesfo

    rtheprin

    cipal

    co

    mpon

    en

    tsis

    equ

    alto

    thesum

    of

    thesam

    plevarian

    cesfor

    theelem

    en

    tsof

    .Th

    atis,

    wh

    ere

    isthesam

    plevarian

    ceof

    .

    4

    3

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    45/66

    y1

    -6.5-6.0-5.5-5.0-4.5-4.0

    -0.4-0.2

    0.00.20.4

    2 4 6 8

    -6.5 -5.5 -4.5

    y2

    y3

    -1.2-0.8-0.4 0.0

    2

    4

    6

    8

    -0.4 0.00.20.4

    -1.2-0.8-0.4

    0.0

    y4

    Fig

    ure

    3.4

    Prin

    cip

    alcom

    pon

    en

    tsc

    oreforFishersIrisData.

    Com

    parewithFig

    ure

    3.3

    4

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    46/66

    Effecti

    veDimensionality

    1.Th

    eproportionof

    varianceaccountedforTake

    the

    first

    prin

    cipal

    com-

    po

    nentsan

    dadduptheirvarian

    ces.

    Dividebythesum

    of

    allth

    evarian

    ces,

    to

    give

    whichis

    calle

    dthe

    proportionofvarianceaccountedforbythefirst

    princi-

    pa

    lcomponents.

    Usually,pro

    jection

    s

    accoun

    tingfor

    o

    ver7

    5%of

    thetotalvarian

    ceare

    con-

    sideredtobegood

    .Th

    us,a2-D

    pi

    cturewill

    becon

    sidere

    dare

    ason

    able

    represen

    tationif

    4

    5

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    47/66

    2.Th

    esizeofimportantvarianceTh

    eideah

    ereis

    to

    consider

    thevariance

    ifalldire

    ction

    sw

    ere

    equallyim

    portan

    t.Inthis

    caseth

    evarian

    cesw

    oul

    dbe

    ap

    proxim

    ately

    Th

    earg

    um

    en

    trun

    s

    If

    ,thenthe

    thprincipaldirectionisles

    sinterestingtha

    n

    average.

    an

    dthisle

    adsustodiscard

    prin

    cip

    alcom

    pon

    en

    tsthath

    ave

    sam

    ple

    vari-

    an

    cesbelow

    .

    3.Sc

    reediagram

    As

    creedia

    gramis

    a

    nindex

    plo

    tof

    theprincipalcom

    po

    nent

    va

    riances.In

    other

    wordsitis

    aplo

    t

    of

    again

    st.A

    nexam

    ple

    of

    as

    cree

    dia

    gram

    ,for

    theIris

    Data,is

    showninFig

    ure

    3.5.

    4

    6

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    48/66

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    49/66

    Norma

    lising

    Th

    eda

    tacan

    ben

    orm

    a

    lisedbycarryin

    g

    outthefollowin

    g

    steps.

    Centre

    eachvaria

    b

    le.Inotherw

    ord

    ssubtractthem

    eanof

    eachvaria

    b

    leto

    giv

    e

    Divide

    eachelem

    en

    tof

    byits

    stan

    darddevia

    tion

    ;asaform

    ula

    thism

    eans

    ca

    lculate

    wh

    ere

    isthesam

    plestan

    dard

    dev

    iation

    of

    .

    4

    8

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    50/66

    PetalL.

    Sepal W.

    -10

    -5

    0

    5

    10

    15

    -10 -5 0 5 10 15

    Mean

    Cen

    tredData

    5xP

    etalL.

    Sepal W.

    -10

    -5

    0

    5

    10

    15

    -10 -5 0 5 10 15

    Scale

    dDa

    ta

    Fig

    ure

    3.6Ifw

    edontn

    orm

    alise.

    4

    9

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    51/66

    Interpr

    etation

    Th

    efin

    alpart

    of

    aprin

    c

    ipalcom

    pon

    en

    ts

    analysisis

    toin

    s

    pecttheeig

    enve

    ctors

    intheh

    opeofid

    en

    tifyingam

    eanin

    gfor

    the(importan

    t)princip

    alcom

    pon

    en

    ts.

    Seeth

    ebookfor

    anin

    te

    rpretationforFis

    hersIrisData.

    5

    0

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    52/66

    3.4.2

    Corresponde

    nceAnalysis

    Corre

    s

    ponden

    ceis

    aw

    aytore

    pre

    sen

    tthestructurewithinincidencematrices.

    Inciden

    cem

    atricesare

    alsocalle

    dtwo-w

    aycontingencytables.

    An

    example

    of

    a

    inciden

    cem

    atrix,withm

    argin

    altotalsis

    show

    nin

    Table

    3

    .17

    .

    Table

    3.17

    Sm

    okin

    gCategory

    Staff

    Gro

    up

    Non

    e

    LightM

    edium

    Heavy

    Total

    SeniorM

    an

    ag

    ers

    4

    2

    3

    2

    11

    JuniorM

    an

    ag

    ers

    4

    3

    7

    4

    18

    SeniorEm

    plo

    yees

    25

    1

    0

    12

    4

    51

    JuniorEm

    plo

    yees

    18

    2

    4

    33

    13

    88

    Secretarie

    s

    10

    6

    7

    2

    25

    Total

    61

    4

    5

    62

    25

    193

    5

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    53/66

    TwoStages

    Transform

    thevalu

    esinaw

    aythatrelate

    stoatestfor

    association

    betw

    een

    row

    san

    dcolumn

    s(chi-square

    dtest).

    Useadim

    en

    sion

    ali

    tyreductionm

    ethodtoallow

    usto

    draw

    apicture

    o

    fthe

    rel

    ation

    ship

    sbetwe

    enrow

    san

    dcolu

    mnsin2-D.

    Details

    arelike

    prin

    cipal

    com

    pon

    en

    tsan

    alysism

    athem

    atically;seetheboo

    k.5

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    54/66

    3.4.3

    MultidimensionalScaling

    Multidimen

    sion

    al

    scalin

    gis

    thepro

    cessofconvertin

    gasetof

    pairwise

    dissimi-

    laritie

    s

    forasetof

    poin

    ts,intoasetof

    co

    -ordin

    atesfor

    the

    points.

    Exam

    p

    lesof

    dissimilarities

    could

    be:

    thepriceof

    an

    airlin

    eticketbetween

    pairsof

    cities;

    roaddistan

    cesbetween

    town

    s(aso

    pposedtostraigh

    t-linedistan

    ces);

    acoefficien

    tin

    dica

    tingh

    ow

    differen

    ttheartefa

    ctsfo

    undin

    pairs

    of

    to

    mbs

    wi

    thinagrave

    yard

    are.

    5

    3

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    55/66

    ClassicalScaling

    Cla

    ssicalscalin

    gis

    also

    known

    asmetricscalingan

    das

    principalco-ordin

    ates

    analys

    is.Th

    en

    am

    em

    etricscalin

    gis

    usedbecausethe

    dissimilaritie

    sare

    as-

    sum

    ed

    tobedistan

    cesorinm

    athem

    atical

    term

    sthem

    easure

    of

    dissimil

    arity

    istheeuclideanmetric.Th

    en

    am

    eprin

    cipal

    co-ordin

    ates

    analysisis

    usedbe-

    cause

    thereis

    alink

    between

    this

    techniq

    uean

    dprin

    cipalc

    ompon

    en

    tsan

    al

    ysis.

    Th

    en

    amecla

    ssicalis

    usedbecauseitwa

    sthefirstwi

    delyusedm

    etho

    dof

    multidimen

    sion

    alscalin

    g,an

    dpre-d

    atesthe

    availa

    bility

    of

    electronic

    com

    pu

    ters.

    Th

    ederiva

    tion

    of

    them

    ethodusedtoo

    btaintheconfig

    u

    rationis

    givenin

    the

    book.

    5

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    56/66

    Th

    ere

    sultsof

    applyin

    g

    classical

    scalin

    g

    toBritishro

    addi

    stancesare

    show

    nin

    Fig

    ure

    3.7.Th

    esero

    ad

    distan

    cescorre

    spon

    dtothero

    u

    tesre

    comm

    en

    de

    dby

    theAu

    tomobileAssociation;thesere

    comm

    en

    dedro

    utes

    arein

    ten

    dedto

    give

    theminimum

    travellin

    gtime

    ,n

    otthetheminim

    um

    journ

    ey

    distan

    ce.

    An

    effectof

    this,

    thatisvisibleinFig

    ure3.7is

    thatthe

    town

    san

    dcitiesh

    ave

    lin

    edupin

    position

    srela

    tedtothem

    otorwayn

    etwork.

    Th

    emapalsofe

    atu

    resdistortion

    sfr

    omthegeogra

    ph

    icalm

    apsuchasthe

    po

    sition

    ofH

    olyh

    ea

    d(holy

    ),which

    a

    ppears

    tobem

    uchclo

    ser

    toLiver

    pool

    (lv

    er)an

    dM

    an

    ches

    terthanitre

    allyis

    ,andtheposition

    ofCornish

    penin

    sula

    (th

    epart

    en

    din

    gatPenzan

    ce,penz

    )

    isfurth

    erfrom

    Ca

    rmarth

    en

    (carm

    )

    than

    iti

    sphysically.

    5

    5

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    57/66

    Compon

    ent1

    Component 2

    -400

    -200

    0

    200

    -200 0 200

    abdn

    abry

    barn

    bham

    bton

    btol

    camb

    card

    carl

    carmcolc

    dorcdovr

    edin

    exet

    fort

    glas

    glou

    gild

    holy

    hull

    invr

    kend

    leed

    linc

    lver

    maid

    manc

    middnewc

    norw

    nott

    oxfd

    penz

    prth

    plym

    shef

    sotn

    stra

    taun

    york

    lond

    Fig

    ure

    3.7

    5

    6

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    58/66

    Ordina

    lScaling

    Ordin

    a

    lscalin

    gis

    used

    forthesam

    epur

    posesatclassicalscalin

    g,butfor

    dis-

    similar

    itiesthataren

    ot

    metric,thatis,

    theyaren

    otwh

    at

    wew

    ould

    think

    ofa

    s

    distan

    ces.Ordin

    alscalingis

    som

    etimes

    callednon-metricscaling,becausethe

    dissimilaritie

    saren

    otm

    etric.Som

    epeople

    callitShepard-Kruskalscaling

    ,be-

    cause

    Shepard

    an

    dKru

    skalare

    then

    am

    esof

    twopion

    eer

    sof

    ordin

    alscalin

    g.

    Inordin

    alscalin

    g,w

    eseekaconfig

    ura

    tioninwhich

    thep

    airwise

    distan

    cesbe-

    tween

    pointsh

    ave

    thes

    amerank

    ord

    er

    a

    sthecorre

    spon

    ding

    dissimilaritie

    s

    .So,

    if

    is

    thedissimilarity

    between

    poin

    ts

    and,an

    d

    is

    thedistan

    cebetw

    een

    thesam

    epoin

    tsin

    the

    derivedconfig

    ura

    tion

    ,thenw

    eseekaconfig

    ura

    tionin

    which

    if

    5

    7

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    59/66

    3.4.4

    ClusterAnaly

    sisandMixtureDecomposition

    Clu

    ster

    analysis

    an

    dmix

    turedecom

    positionare

    bothtechniqu

    estodowithi

    den-

    tificatio

    nof

    con

    cen

    tratio

    nsofin

    divid

    uals

    inaspace.

    5

    8

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    60/66

    Cluste

    rAnalysis

    Clu

    ster

    analysisis

    used

    toiden

    tifygro

    upsofindivid

    ualsin

    asam

    ple.Th

    egro

    ups

    aren

    o

    tpre-d

    efin

    ed,n

    or

    ,usually,is

    then

    umber

    of

    gro

    ups

    .Thegro

    upstha

    tare

    iden

    tifi

    edarereferre

    dto

    asclusters.

    hierarchical

    agglomerative

    divisive

    no

    n-hierarchical

    5

    9

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    61/66

    Mi

    nimum

    distance

    orsingle-link

    Ma

    ximum

    distance

    orcomplete-link

    Av

    eragedistance

    Ce

    ntroiddistance

    definesthedistan

    cebetween

    twoclusters

    asthesquared

    dis

    tancebetween

    themeanve

    ctors

    (thatis,

    thecen

    troids)of

    thetwoclus-

    ter

    s.

    Su

    mofsquareddeviationsdefin

    esthedistan

    cebetwe

    en

    twocluster

    sas

    thesum

    of

    thesqua

    reddistan

    cesofindivid

    ualsfrom

    thejoin

    tcen

    troid

    o

    fthe

    thetwoclustersmin

    usthesum

    of

    thesquare

    ddistan

    c

    esofin

    divid

    uals

    from

    theirse

    para

    teclust

    ermean

    s.

    6

    0

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    62/66

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0 1 2 3 4 5 6

    Distance between clusters

    Fig

    ure

    3.8

    Usualw

    aytopre

    sen

    tre

    sultsofhierarchic

    alclusterin

    g.

    6

    1

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    63/66

    Non-hi

    erarchical

    clusteringis

    essen

    tially

    tryingtopartition

    thesam

    ple

    soasto

    optimiz

    esom

    em

    easure

    ofclusterin

    g.

    Th

    ech

    oiceofm

    easure

    ofclusterin

    gis

    u

    sually

    basedon

    propertie

    sof

    sum

    sof

    square

    sandpro

    ductsm

    atrices,like

    thoseme

    tin

    Section

    3

    .3.1,becausethe

    aim

    intheMAN

    OVAis

    tom

    easure

    differen

    ce

    sbetween

    gro

    ups.

    Th

    em

    ain

    difficultyh

    ere

    isthatthere

    are

    toom

    an

    ydifferen

    twaystopartition

    the

    sam

    plefor

    ustotrythem

    all,

    unle

    ssthesam

    pleisvery

    small

    (aro

    un

    da

    bout

    or

    smaller).Thus

    our

    onlyw

    a

    y,ingen

    eral,

    of

    guaran

    teein

    gtha

    tthe

    glo

    bal

    optimumis

    achie

    vedis

    touseam

    ethodsuchasbr

    anch-an

    d-b

    oun

    d

    On

    eo

    fthebestkn

    ownnon-hierarchicalclu

    sterin

    gm

    ethodsis

    the

    -means

    method.

    6

    2

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    64/66

    MixtureDecomposition

    Mixture

    decom

    position

    isrela

    tedtoclusteran

    alysisin

    thatitis

    usedtoid

    e

    ntify

    con

    cen

    tration

    sofin

    divid

    uals.Th

    ebasicdifferen

    cebetwee

    ncluster

    an

    alysis

    and

    mixture

    decom

    positioni

    sthatthereis

    an

    underlyin

    gstatis

    ticalm

    odelinmix

    ture

    decom

    position

    ,wh

    ere

    asthereisn

    osuchmo

    delin

    cluster

    analysis.Th

    epr

    oba-

    bility

    d

    ensitythath

    asgenera

    tedthesam

    pledatais

    assum

    edtobeamixtu

    reof

    severa

    lunderlyin

    gdistri

    bution

    s.Sow

    eh

    ave

    wh

    ere

    isthen

    um

    be

    rofun

    derlyin

    gd

    istribution

    s,the

    sare

    theden

    sitie

    s

    of

    the

    underlyin

    gdistrib

    ution

    s,the

    s

    aretheparam

    etersof

    theun

    der

    lying

    distrib

    utions,the

    sa

    repositivean

    dsumtoon

    e,an

    d

    istheden

    sity

    from

    which

    the

    sam

    pleh

    asb

    eengen

    era

    ted.

    Details

    inon

    eofH

    an

    ds

    books.

    6

    3

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    65/66

    3.4.5

    LatentVariab

    leandCovarianceStructureModels

    Ih

    ave

    never

    usedthetechniq

    uesin

    this

    section

    ,soI

    do

    notcon

    ssiderm

    yself

    exp

    ert

    enough

    togive

    a

    presen

    tation

    on

    them.

    Noten

    ough

    timetocovereverythin

    g.

    6

    4

  • 8/13/2019 [] Statistics - Statistical Methods for Data Analytic

    66/66

    3.5Summary

    Th

    etechniq

    uespre

    sen

    tedinthis

    chapter

    donotform

    an

    yth

    inglike

    an

    e

    listof

    useful

    statisticalm

    ethods.Th

    esetechniq

    uesw

    ere

    chosen

    bec

    are

    eith

    erwid

    ely

    usedoro

    ugh

    ttobewi

    delyused.Th

    er

    egression

    t

    arewid

    elyused,though

    thereis

    som

    erel

    uctan

    ceam

    on

    gs

    tresearch

    e

    thejum

    pfromlin

    earm

    o

    delstogen

    eralize

    dlinearm

    odels.

    Th

    em

    ultivaria

    tean

    alysi

    stechniq

    uesoug

    httobeusedm

    o

    rethan

    they

    of

    them

    ainobstaclesto

    theadoption

    of

    thesetechniq

    uesm

    aybethat

    areinlinear

    alg

    ebra.

    Ife

    elth

    etechniq

    uespre

    sentedin

    this

    ch

    apter,

    an

    dtheir

    e

    xtension

    s,w

    or

    bec

    omethem

    ostwid

    elyusedstatisticaltechniq

    ues.Thisiswh

    y

    chosenfor

    this

    chapter.