statistical multiple decision makingpeople.stat.sc.edu/pena/talkspresented/penamusttalk2012.pdf ·...

Statistical Multiple Decision Making

Edsel A. PenaUniversity of South CarolinaColumbia, South Carolina(E-Mail: [email protected])

MUST Colloquium TalkMarch 2012

Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making

Outline

▶ Some Motivating Problems.

▶ Multiple Decision Problems.

▶ Mathematical Framework (Decision Functions, Losses, Risks).

▶ Special Case: Optimal Choice Between Two Actions.

▶ Multiple Decision Processes.

▶ Multiple Decision Size Function.

▶ Class of FWER-Controlling MDFs.

▶ Class of FDR-Controlling MDFs.

▶ An Application to a Microarray Data Set.

▶ Towards Optimal MDFs.

▶ Applicability and Some Comparisons.


Some Motivating Questions and Areas of Relevance

▶ Microarray data analysis: Which genes are relevant?

▶ Variable selection: Which of many predictors are relevant?

▶ Survival analysis: Which predictors affect a lifetime variable?

▶ Reliability/Engineering: Which components in a system arerelevant?

▶ Epidemiology: Spread of a disease in a geographical area.

▶ Oil (mineral) exploration: Where to dig?

▶ Business: Locations of business ventures.


A Microarray Data: HeatMap of Gene Expression Levels

First 100 genes out of 41267 genes in a colon cancer study at USC(M Pena’s Lab). Three groups (Control; 9 Days; 2 Weeks) with 6replicates each.

Co1

Co2

Co3

Co4

Co5

Co6

9D1

9D2

9D3

9D4

9D5

9D6

MTS1

MTS2

MTS3

MTS4

MTS5

MTS6

DtnbAK083162AW822216LOC383462Spnb2Bsdc1Ube1dc14930544G21RikEda2610029G23RikMlc1Thap4Bai3Mylc2plD10Ertd322ePhf8Prkag3TC1048469Scn2a1Clic3Praf1Zfhx1bMuc1Trim7BB438848H1fx2410018C17RikHist2h4Cnot4Olfr1325Nup205Slc16a13Timm17b4932416A15NAP059572−1AK076557Tde2lSfrp4PnnSnip1Dnaja2Nfatc2ip2610024E20RikCul54930553M18RikRsbn1Rbm13C430010P07RikCfhSpecc1UtxNudt161200015N20Rik3110002L15RikFndc3bLrrc40BtrcSnap29Rnpc1LOC545814Ascc3l15730442P18RikA530023G15RikBzrap1Ntan12300010F08RikSlc9a7Ipo7D930036F22Rik1810058N15RikGsk3bShprhMyripD5Bwg0860eCcr92810429K17RikAtad3aMyh92810422B04RikZfpn1a1Olfr745E430029F06SolhAtp11bMfap1St6galnac5Nxph4C030039E19RikTrim30Slc39a8XM_148854Lair1Dpy19l1TeraD030004A10RikPargAsh1l1110064P04Rik

HeatMap of First 100 Genes


A Typical Variable Selection Problem

▶ Model.

Y = �0 +M∑

j=1

�jXj + �

▶ M is large, but many �js are equal to zero.

▶ Observed Data: For j = 1, 2, . . . , n,

(Zj , �j ,X1j ,X2j , . . . ,XMj)

withZj = min(Yj ,Cj) and �j = I{Yj ≤ Cj}

▶ Goal: To select the relevant predictor variables.


A Reliability (or Biological Pathways) Problem

▶ System is composed of components.

▶ Structure function, �, relates components to system: series,parallel, series-parallel, etc.

▶ M potential components that could constitute a system. Wedo not know which components are relevant nor do we knowthe structure function.

▶ Question: Given data regarding the states or lifetimes of thesystem and components, how could we determine whichcomponents are relevant for this system?

▶ Component lifetimes may be censored by system lifetime.

▶ Highly nonlinear types of relationships.


The General Decision Problem

▶ We would like to discover the value of a parameter

� = (�1, �2, . . . , �M) ∈ Θ = {0, 1}M

▶ �m = 1 means mth component is relevant; �m = 0 means mthcomponent is not relevant.

▶ Want to choose an action

a = (a1, a2, . . . , aM) ∈ A = {0, 1}M

▶ am = 1 means we declare that �m = 1, called a discovery;am = 0 means we declare that �m = 0, a non-discovery.


Assessing our Actions: Losses

▶ Family-wise error indicator:

L0(a, �) = I

{

M∑

m=1

am(1− �m) > 0

}

▶ False Discovery Proportion:

L1(a, �) =

∑Mm=1 am(1− �m)

max{∑Mm=1 am, 1}

▶ Missed Discovery Proportion:

L2(a, �) =

∑Mm=1(1− am)�m

max{∑Mm=1 �m, 1}


If Only We Still Have Paul, the Oracle!


Sadly (or, Gladly), Revert to Being Statisticians!

▶ Obtain a BIG data (e.g., microarrays, Netflix):

X ∈ X

▶ Probabilistic Structure:

X ∼ P , P is a Probability Measure

▶ Marginal Components:

Xm = zm(X ) ∈ Xm and Xm ∼ Pm = Pz−1m

▶ Parameters of Interest:

�m = �m(Pm)

▶ Example:

�m = 1 ⇐⇒ Pm ∈ {N(�, �2) : � ≥ 0, �2 > 0}


Multiple Decision Functions

▶ Multiple Decision Function: � : X → A

▶ Components: � = (�1, �2, . . . , �M)

�m : X → {0, 1}

Remark: �m may use the whole data, not just Xm.

▶ D: space of multiple decision functions.

▶ ℳ0 = {m : �m = 0} and ℳ1 = {m : �m = 1}▶ Structure: {�m(X ) : m ∈ ℳ0} is an independent collection,

and is independent of {�m(X ) : m ∈ ℳ1}.▶ {�m(X ) : m ∈ ℳ1} need NOT be an independent collection.


Risk Functions: Averaged Losses

▶ Given a � ∈ D:

▶ Family-Wise Error Rate (FWER):

R0(�,P) = E [L0(�(X ), �(P))]

▶ False Discovery Rate (FDR):

R1(�,P) = E [L1(�(X ), �(P))]

▶ Missed Discovery Rate (MDR):

R2(�,P) = E [L2(�(X ), �(P))]

▶ Expectations are with respect to X ∼ P .

▶ Goal: Choose � ∈ D with small risks, whatever P is.


Special Case: A Pair of Choices (M = 1)

▶ � ∈ Θ = {0, 1}▶ a ∈ A = {0, 1}▶ L0(a, �) = L1(a, �) = aI (� = 0)

▶ L2(a, �) = (1− a)I (� = 1)

▶ X ∼ P with P ∈ {P0,P1}

▶ R0(�, �) = R1(�, �) = P0(�(X ) = 1)I (� = 0)

▶ R2(�, �) = [1− P1(�(X ) = 1)]I (� = 1)

▶ Assume P0 and P1 have respective densities:

f0(x) and f1(x)


Types I and II Errors, Power, and Optimality

▶ R0(�, �) : Type I error probability.

▶ R2(�, �) : Type II error probability.

▶ NoteR2(�, � = 1) = 1− �(�)

where�(�) = P1(�(X ) = 1) = POWER of �.

▶ Desired Goal: Given Type I level � ∈ [0, 1], find �∗(⋅;�) with

R0(�∗, �) ≤ �, for all �,

andR1(�

∗, �) ≤ R1(�, �), for all �,

for any other � with R1(�, �) ≤ �,∀�.


Neyman-Pearson MP Test �∗�

▶ Neyman and Pearson (1933) obtained the optimal [mostpowerful] decision function to be of form

�∗�(x) =

⎧

⎨

⎩

1 if f1(x) > c(�)f0(x) (x) if f1(x) = c(�)f0(x)0 if f1(x) < c(�)f0(x)

where c(�) and (�) satisfy

R0(�∗�, � = 0) = �.

▶ Remark: Depends on �, hence power depends on �.

▶ Leads to the notion of a decision process.


Concrete Example of a Decision Process

▶ Model: X = (X1,X2, . . . ,Xn)IID∼ N(�, �2).

▶ Problem: Test H0 : � ≤ �0 [� = 0] vs H1 : � > �0 [� = 1]

▶ Decision Function: t-test of size � given by

�(X ;�) = I

{√n(X − �0)

S≥ tn−1;�

}

▶ Decision function depends on the size index �.

▶ Decision Process:

Δ = (�(�) ≡ �(⋅;�) : � ∈ [0, 1])

▶ Size Condition:

sup{EP [�(X ;�)] : �(P) = 0} ≤ �


Multiple Decision Process

▶ Consider a multiple decision problem with M components.

▶ Multiple Decision Process:

Δ = (Δm : m ∈ ℳ = {1, 2, . . . ,M})

▶ Decision Process for mth Component:

Δm = (�m(�) : � ∈ [0, 1])

▶ Example: t-test decision process for each component.

▶ Usual Approach: Pick a �m from Δm using the same �.

▶ Common Choices for �: (weak) FWER Threshold of q use:

Bonferroni: � = q/M

Sidak: � = 1− (1− q)1/M


Notion of Size Functions

▶ A size function is a function

A : [0, 1] → [0, 1]

which is continuous, strictly increasing, A(0) = 0 andA(1) ≤ 1, and possibly differentiable.

▶ Bonferroni size function: A(�) = �/M

▶ Sidak size function: A(�) = 1− (1− �)1/M

▶ S: collection of possible size functions.

▶ Given a decision process Δ and a size function A, we choosethe decision function from Δ according to

�[A(�)].


Multiple Decision Size Function

▶ For a multiple decision problem with M components, amultiple decision size function is

A = (Am : m ∈ ℳ) with Am ∈ S.

▶ Condition:1−

∏

m∈ℳ

[1− Am(�)] ≤ �

▶ Given a Δ = (Δm : m ∈ ℳ) and an A = (Am : m ∈ ℳ),multiple decision function is chosen according to

�(�) = (�m[Am(�)] : m ∈ ℳ)

▶ Weak FWER of �(�):

R0(�(�),P) = 1−∏

[1− Am(�)] ≤ �


Neyman-Pearson Paradigm

▶ Control Type I error rate; minimize Type II error rate.

▶ Desired Type I error threshold: q ∈ (0, 1)

▶ Weak Control: For P with �m(P) = 0 for all m, want a � with

R0(�,P) ≤ q or R1(�,P) ≤ q.

▶ Strong Control: Whatever P is, want a � such that

R0(�,P) ≤ q or R1(�,P) ≤ q.

▶ And, if above Type I error control is achieved, we want tohave R2(�,P) small, if not optimal.


Towards Strong FWER Control

Given a MDP Δ = (Δm) and MDS A = (Am), for the chosen � at�, its FWER is

R0(�,P) = EP

{

I(

∑

�m[Am(�)][1 − �m(P)] > 0)}

= P

⎧

⎨

⎩

∑

ℳ0

�m[Am(�)] > 0

⎫

⎬

⎭

= 1−∏

ℳ0

[1− Am(�)]

= 1−∏

[1− Am(�)]1−�m(P)

Question: Given a threshold of q, what is the best �?


‘Best’ Choice of �

▶ Oracle Paul’s Choice:

�†(q;P) = inf{

� ∈ [0, 1] :∏

[1− Am(�)]1−�m(P) < 1− q

}

▶ But, P is unknown, hence �m(P) is also unknown.

▶ However, we could estimate �m(P) by

�m[Am(�)−].

▶ The Oracle’s choice is then estimated by

�†(q) = inf{

� ∈ [0, 1] :∏

[1− Am(�)]1−�m [Am(�)−] < 1− q

}


Strong FWER-Controlling MDF

▶ Chosen Multiple Decision Function:

�†(q) =(

�m[Am(�†(q))] : m ∈ ℳ

)

▶ Theorem (Pena, Habiger, Wu, 2011, Ann Stat)

Given a Δ = (Δm) and an A = (Am), the �†(q) defined above has

R0(�†(q),P) ≤ q,

whatever P is. Thus, it is an MDF achieving strong FWER control

at level q.


Towards FDR Control

▶ Given MDP Δ = (Δm) and MDS A = (Am), the MDF

�(�) = (�m[Am(�)] : m ∈ ℳ)

has FDR

R1(�(�),P) = EP

{∑

�m[Am(�)](1− �m(P))∑

�m[Am(�)]

}

▶ Observe:

EP

{

∑

�m[Am(�)](1− �m(P))}

≤∑

Am(�)


‘Best’ Choice of �

▶ Preceding considerations heuristically suggest the �:

�∗(q) = sup{

� ∈ [0, 1] :∑

Am(�) ≤ q∑

�m[Am(�)]}

▶ Chosen Multiple Decision Function:

�∗(q) = (�m[Am(�∗(q))] : m ∈ ℳ)

▶ Theorem (Pena, et al, 2011, Ann Stat)

Given a pair (Δ,A), the MDF �∗(q) achieves FDR control at level

q in that

R1(�∗(q),P) ≤ q,

whatever P is.


Classes of MDFs Controlling FWER and FDR

▶ A class of strong FWER-controlling MDFs at threshold q is:

D† =

{

�†(q;Δ,A) : Δ ∈ D,A ∈ S

}

▶ A class of FDR-controlling MDFs at threshold q is:

D∗ = {�∗(q;Δ,A) : Δ ∈ D,A ∈ S}

▶ Remark: Sidak’s sequential step-down strong FWERcontrolling MDF belongs to D

†.

▶ Remark: Benjamini-Hochberg’s step-up FDR controlling MDFbelongs to D

∗.

▶ Potential Utility: May choose best MDF in D† or D∗ wrt the

missed discovery rate.


Recalling BH FDR-Controlling MDF

▶ Benjamini-Hochberg (JRSS B, ’95) paper. Most well-knownFDR-controlling procedure.

▶ Let P1,P2, . . . ,PM be the ordinary P-values from the M tests.

▶ Let P(1) < P(2) < . . . < P(M) be the ordered P-values.

▶ For FDR-threshold equal to q, define

K = max

{

k ∈ {0, 1, 2, . . . ,M} : P(k) ≤qk

M

}

.

▶ BH MDF �BH(q) = (�BHm : m ∈ ℳ) has

�BHm (X ) = I{

Pm ≤ P(K)

}

, m ∈ ℳ.

▶ Simple and easy-to-implement, but is it the BEST?


Applying BH Procedure to a Two-Group Microarray Data

▶ Agilent Technology microarray data set from M. Pena’s lab.Jim Ryan of NOAA did the microarray analysis.

▶ M = 41267 genes.

▶ 2 groups, each group with 5 replicates.

▶ Applied t-test for each gene, using logged expression values.P-values obtained.

▶ Applied Benjamini-Hochberg Procedure with q = .15 to pickout the significant genes from the M = 41267 genes.

▶ Procedure picked out 2599 significant genes.

▶ Further analyzed the top (wrt to their p-values) 200 genesfrom these selected genes.

▶ Performed a cluster analysis on these 200 genes.


Histogram of the P-Values from the t-Tests

Histogram of data$P.CTFL

P−Values for CT versus FL

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

010

0020

0030

0040

0050

0060

00


Scatterplot of the Pairwise Gene Means

2 4 6 8 10 12

24

68

1012

Significant and Chosen Genes

logCTmean

logFL

mea

n

A GeneSigGeneChosen


Heatmap of the 200 Top Genes

CT1

CT2

CT3

CT4

CT5

FL1

FL2

FL3

FL4

FL5

Bcl11b0710005M24RikTrim66Slc15a3Trim66Ap3b2Sdk1Csf2Cpb1Cldn23Snap25Atp2b3SacyGrhl2BC0047286330404F12RikLgr6A130039I20RikE330032C10Add2Lrp4Lpin3BU531427Car9Gng2Rab15Btbd11Rab11fip1Tmcc3Cntnap4LpxnWdr18Grin3bRhcedTnfsf13bSlc38a1Gng2TC1019243Actn3LOC236749Dysf1700027L20RikChrna1PgfGdpd3Lgr6Add2G431002C21RikIvns1abpTC1087988Mtap1aKctd12bAdam12Slc1a3Ces14921504I05RikBbs1Tph2ArhgdigD630002G06Pdgfb8430408G22RikChst7Abca13Adam12Herc3Sfmbt2CsenCpxm12510009E07RikCav20610009O20RikNanos1D930005D10Rik6720457D02RikBC049806Eef1d4833417J20RikAcss2Ablim1AK028065Asb51700001L05RikAK036268C730048C13RikPrkcmVax2SetmarWnt6Ablim1AHFAI646023Tcrb−V13Tcrb−V13Tcrb−V13Parc7530403E16RikMtap6Mtap6Anxa85730410E15RikRgs17PtprgMafKcnab2Llglh2C130065N10RikHist1h2bcSpnb2Zfp87Dusp181300007F04RikTC1086095NAP046356−12010007H06RikPtgdsFgd3Plagl1CalcrlSpata9Fabp3Rab3il1Prss16Esm1D17H6S56E−5Prrx2Trim47PmvkSqleDekCrip2PmvkHmgcs1Plac9C230078M08RikOmdAcss2NAP034365−1BambiS100a150610031P221110020G09RikMgst35430405N12RikD14Ertd449eHsd17b71700025G04RikFnbp16230416A05RikAA986860Slc9a3r1CryabMtap6Kif1bD930005D10RikPlekha6Insig1Nsd1Anxa6AfpMdm2Lrp2bp1190002H23RikIfi27Ifi27Ifi271810015M01RikRdh11LssTbcdCrip1Fdft1HrblKhkSt3gal5Ninj1Foxc12010111I01RikIqgap1NsdhlCyp51MvkPcdh18Stard4D5Bwg0834eOact19030611O19RikNM_029423Unc5aSnta1Acss2Efnb21810020D17Rik1110020G09RikHabp4E2f2Mgst35033430I15RikD5Bwg0834eB230205M03


Pictorial Depiction of Gene Clusters of Top 200 Genes

2 4 6 8 10

24

68

10Clusters in CT vs FL Space

Means of Log CT

Mean

s of L

og FL

1234567891011121314151617181920


Can We Obtain a Better MDF than BH?

▶ IDEA: Given MDP Δ = (Δm : m ∈ ℳ), we find the optimalMDS A∗ ≡ A∗(Δ) ∈ S achieving smallest MDR

R2[(Δ ∘ A)(�),P1] =1

M

∑

{1− �m [Am(�)]} .

▶ �m(�) = POWER of �m(�)

▶ FWER-controlling MDF:

�†(q) = �†(q;Δ,A∗(Δ))

▶ FDR-controlling MDF:

�∗(q) = �∗(q;Δ,A∗(Δ))

▶ Use the best MDP Δ, e.g., MPs; UMPs; UMPUs; UMPIs.


Role of Power or ROC Functions

▶ P-value based procedures ignore differences in powers.

▶ Neyman and Pearson: power germane in search for optimality.

▶ Power of mth Test: �m(�) = EPm1{�m(X ;�)}

▶ ROC Function for mth Decision Process Δm:

� 7→ �m(�)

▶ ROC functions in the missed discovery rate.

▶ Enables exploiting differences in the ROC functions.

▶ Why Power or ROC Differences? Different effect sizes,decision processes, or dispersion parameters.

▶ EXCHANGEABILITY: EXCEPTION rather than RULE!


Case with Simple Nulls and Simple Alternatives

▶ Neyman-Pearson Most Powerful Decision Process for each m.

▶ ROC Functions:� 7→ �m(�)

▶ ROC functions are concave, continuous, and increasing.

▶ Assume that they are also twice-differentiable.

Theorem (Pena, et al, 2011, Ann Stat)

Multiple decision size function (� 7→ Am(�) : m ∈ ℳ) is optimal

if it satisfies the M + 1 equilibrium conditions

∀m ∈ ℳ : �′m(Am)(1− Am) = � for some � ∈ ℜ;

∑

ℳ

log(1− Am) = log(1− �).


Example: Optimal Multiple Decision Size Function

▶ M = 2000

▶ For each m: Xm ∼ N(�m, � = 1)

▶ Multiple Decision Problem: To test

Hm0 : �m = 0 versus Hm1 : �m = m.

▶ Effect Sizes: mIID∼ ∣N(0, 3)∣

▶ For each m, Neyman-Pearson MP decision process.

Δm = (�m(�) : � ∈ [0, 1])

�m(xm;�) = I{xm ≥ Φ−1(1− �)}▶ Power or ROC Function for the mth NP MP Decision Process:

� 7→ �m(�) = 1− Φ[

Φ−1(1 − �)− m]


Optimal Test Sizes vs Effect Sizes

Effect Size

De

nsi

ty H

isto

gra

m

0 2 4 6 8 10 12 14

0.0

00

.10

0.2

0

0 2 4 6 8 10 12 14

0e

+0

04

e−

05

8e

−0

5

Effect Size

Op

tima

l Te

st S

ize

0e+00 2e−05 4e−05 6e−05 8e−05

0.0

0.2

0.4

0.6

0.8

1.0

Optimal Test Size

Op

tima

l Te

st P

ow

er

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Effect Size

Po

we

r (B

lue

=O

ptim

al;

Re

d=

Sid

ak)


Economic Aspect: A Size-Investing Strategy

▶ Do not invest your size on those where you will not makediscoveries (small power) or those that you will certainly makediscoveries (high power)!

▶ Rather, concentrate on those where it is a bit uncertain, sinceyour differential gain in overall discovery rate would begreater!

▶ Some Wicked Consequences▶ Departmental Merit Systems.▶ Graduate Student Advising.


BH MDF versus �∗(q): q∗ = .1; M = 20; 1000 Reps

� p �∗F -FDR �∗F -MDR∗ �BH -FDR �BH -MDR∗

1 0.1 8.03 70.80 8.43 72.641 0.2 7.55 79.64 8.77 81.991 0.4 6.05 77.47 6.65 80.30

2 0.1 7.70 54.42 8.43 55.802 0.2 7.39 56.32 7.59 57.312 0.4 6.47 47.82 6.21 49.38

4 0.1 9.14 8.62 9.48 10.304 0.2 7.80 7.34 6.97 9.204 0.4 6.15 3.58 5.65 5.53


BH MDF versus �∗(q): q∗ = .1; M = 100; 1000 Reps

� p �∗F -FDR �∗F -MDR∗ �BH -FDR �BH -MDR∗

1 0.1 9.14 87.10 9.02 90.021 0.2 8.21 84.05 8.78 87.381 0.4 5.92 80.12 5.88 83.73

2 0.1 9.79 66.10 9.24 67.932 0.2 7.68 58.25 7.94 59.932 0.4 5.74 49.29 6.10 50.90

4 0.1 8.37 10.44 8.62 12.364 0.2 7.72 5.93 7.81 8.224 0.4 5.69 3.80 6.14 5.72


Potential Applications and Concluding Remarks

▶ Microarray data analysis: which genes are important?

▶ Systems analysis (Biological Pathways?): which components(subsystems of genes) are relevant?

▶ Variable selection: which predictor variables are important?

▶ For each gene, component, or predictor variable, apply adecision function to decide whether, say, independence ordependence holds with respect to the response variable.

▶ Test for Independence: Kendall’s procedure, for example.

▶ Use MDFs �†(q) or �∗(q).

▶ Issues of determining effect sizes to determine power or ROCfunctions still need further studies.

▶ Comparison with other methods, such as those usingregularization?


Acknowledgements

▶ Wensong Wu, my PhD student at USC Stat; Asst Prof at FIU.

▶ Josh Habiger, my PhD student at USC Stat; Asst Prof atOklahoma State.

▶ Marge Pena (Biology), Yu Zhang (Biology), and James Ryan(NOAA).

▶ NSF and NIH for research support.


statistical multiple decision makingpeople.stat.sc.edu/pena/talkspresented/penamusttalk2012.pdf ·...

Documents