statistical multiple decision makingpeople.stat.sc.edu/pena/talkspresented/penamusttalk2012.pdf ·...
TRANSCRIPT
Statistical Multiple Decision Making
Edsel A. PenaUniversity of South CarolinaColumbia, South Carolina(E-Mail: [email protected])
MUST Colloquium TalkMarch 2012
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Outline
▶ Some Motivating Problems.
▶ Multiple Decision Problems.
▶ Mathematical Framework (Decision Functions, Losses, Risks).
▶ Special Case: Optimal Choice Between Two Actions.
▶ Multiple Decision Processes.
▶ Multiple Decision Size Function.
▶ Class of FWER-Controlling MDFs.
▶ Class of FDR-Controlling MDFs.
▶ An Application to a Microarray Data Set.
▶ Towards Optimal MDFs.
▶ Applicability and Some Comparisons.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Some Motivating Questions and Areas of Relevance
▶ Microarray data analysis: Which genes are relevant?
▶ Variable selection: Which of many predictors are relevant?
▶ Survival analysis: Which predictors affect a lifetime variable?
▶ Reliability/Engineering: Which components in a system arerelevant?
▶ Epidemiology: Spread of a disease in a geographical area.
▶ Oil (mineral) exploration: Where to dig?
▶ Business: Locations of business ventures.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
A Microarray Data: HeatMap of Gene Expression Levels
First 100 genes out of 41267 genes in a colon cancer study at USC(M Pena’s Lab). Three groups (Control; 9 Days; 2 Weeks) with 6replicates each.
Co1
Co2
Co3
Co4
Co5
Co6
9D1
9D2
9D3
9D4
9D5
9D6
MTS1
MTS2
MTS3
MTS4
MTS5
MTS6
DtnbAK083162AW822216LOC383462Spnb2Bsdc1Ube1dc14930544G21RikEda2610029G23RikMlc1Thap4Bai3Mylc2plD10Ertd322ePhf8Prkag3TC1048469Scn2a1Clic3Praf1Zfhx1bMuc1Trim7BB438848H1fx2410018C17RikHist2h4Cnot4Olfr1325Nup205Slc16a13Timm17b4932416A15NAP059572−1AK076557Tde2lSfrp4PnnSnip1Dnaja2Nfatc2ip2610024E20RikCul54930553M18RikRsbn1Rbm13C430010P07RikCfhSpecc1UtxNudt161200015N20Rik3110002L15RikFndc3bLrrc40BtrcSnap29Rnpc1LOC545814Ascc3l15730442P18RikA530023G15RikBzrap1Ntan12300010F08RikSlc9a7Ipo7D930036F22Rik1810058N15RikGsk3bShprhMyripD5Bwg0860eCcr92810429K17RikAtad3aMyh92810422B04RikZfpn1a1Olfr745E430029F06SolhAtp11bMfap1St6galnac5Nxph4C030039E19RikTrim30Slc39a8XM_148854Lair1Dpy19l1TeraD030004A10RikPargAsh1l1110064P04Rik
HeatMap of First 100 Genes
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
A Typical Variable Selection Problem
▶ Model.
Y = �0 +M∑
j=1
�jXj + �
▶ M is large, but many �js are equal to zero.
▶ Observed Data: For j = 1, 2, . . . , n,
(Zj , �j ,X1j ,X2j , . . . ,XMj)
withZj = min(Yj ,Cj) and �j = I{Yj ≤ Cj}
▶ Goal: To select the relevant predictor variables.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
A Reliability (or Biological Pathways) Problem
▶ System is composed of components.
▶ Structure function, �, relates components to system: series,parallel, series-parallel, etc.
▶ M potential components that could constitute a system. Wedo not know which components are relevant nor do we knowthe structure function.
▶ Question: Given data regarding the states or lifetimes of thesystem and components, how could we determine whichcomponents are relevant for this system?
▶ Component lifetimes may be censored by system lifetime.
▶ Highly nonlinear types of relationships.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
The General Decision Problem
▶ We would like to discover the value of a parameter
� = (�1, �2, . . . , �M) ∈ Θ = {0, 1}M
▶ �m = 1 means mth component is relevant; �m = 0 means mthcomponent is not relevant.
▶ Want to choose an action
a = (a1, a2, . . . , aM) ∈ A = {0, 1}M
▶ am = 1 means we declare that �m = 1, called a discovery;am = 0 means we declare that �m = 0, a non-discovery.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Assessing our Actions: Losses
▶ Family-wise error indicator:
L0(a, �) = I
{
M∑
m=1
am(1− �m) > 0
}
▶ False Discovery Proportion:
L1(a, �) =
∑Mm=1 am(1− �m)
max{∑Mm=1 am, 1}
▶ Missed Discovery Proportion:
L2(a, �) =
∑Mm=1(1− am)�m
max{∑Mm=1 �m, 1}
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
If Only We Still Have Paul, the Oracle!
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Sadly (or, Gladly), Revert to Being Statisticians!
▶ Obtain a BIG data (e.g., microarrays, Netflix):
X ∈ X
▶ Probabilistic Structure:
X ∼ P , P is a Probability Measure
▶ Marginal Components:
Xm = zm(X ) ∈ Xm and Xm ∼ Pm = Pz−1m
▶ Parameters of Interest:
�m = �m(Pm)
▶ Example:
�m = 1 ⇐⇒ Pm ∈ {N(�, �2) : � ≥ 0, �2 > 0}
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Multiple Decision Functions
▶ Multiple Decision Function: � : X → A
▶ Components: � = (�1, �2, . . . , �M)
�m : X → {0, 1}
Remark: �m may use the whole data, not just Xm.
▶ D: space of multiple decision functions.
▶ ℳ0 = {m : �m = 0} and ℳ1 = {m : �m = 1}▶ Structure: {�m(X ) : m ∈ ℳ0} is an independent collection,
and is independent of {�m(X ) : m ∈ ℳ1}.▶ {�m(X ) : m ∈ ℳ1} need NOT be an independent collection.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Risk Functions: Averaged Losses
▶ Given a � ∈ D:
▶ Family-Wise Error Rate (FWER):
R0(�,P) = E [L0(�(X ), �(P))]
▶ False Discovery Rate (FDR):
R1(�,P) = E [L1(�(X ), �(P))]
▶ Missed Discovery Rate (MDR):
R2(�,P) = E [L2(�(X ), �(P))]
▶ Expectations are with respect to X ∼ P .
▶ Goal: Choose � ∈ D with small risks, whatever P is.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Special Case: A Pair of Choices (M = 1)
▶ � ∈ Θ = {0, 1}▶ a ∈ A = {0, 1}▶ L0(a, �) = L1(a, �) = aI (� = 0)
▶ L2(a, �) = (1− a)I (� = 1)
▶ X ∼ P with P ∈ {P0,P1}
▶ R0(�, �) = R1(�, �) = P0(�(X ) = 1)I (� = 0)
▶ R2(�, �) = [1− P1(�(X ) = 1)]I (� = 1)
▶ Assume P0 and P1 have respective densities:
f0(x) and f1(x)
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Types I and II Errors, Power, and Optimality
▶ R0(�, �) : Type I error probability.
▶ R2(�, �) : Type II error probability.
▶ NoteR2(�, � = 1) = 1− �(�)
where�(�) = P1(�(X ) = 1) = POWER of �.
▶ Desired Goal: Given Type I level � ∈ [0, 1], find �∗(⋅;�) with
R0(�∗, �) ≤ �, for all �,
andR1(�
∗, �) ≤ R1(�, �), for all �,
for any other � with R1(�, �) ≤ �,∀�.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Neyman-Pearson MP Test �∗�
▶ Neyman and Pearson (1933) obtained the optimal [mostpowerful] decision function to be of form
�∗�(x) =
⎧
⎨
⎩
1 if f1(x) > c(�)f0(x) (x) if f1(x) = c(�)f0(x)0 if f1(x) < c(�)f0(x)
where c(�) and (�) satisfy
R0(�∗�, � = 0) = �.
▶ Remark: Depends on �, hence power depends on �.
▶ Leads to the notion of a decision process.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Concrete Example of a Decision Process
▶ Model: X = (X1,X2, . . . ,Xn)IID∼ N(�, �2).
▶ Problem: Test H0 : � ≤ �0 [� = 0] vs H1 : � > �0 [� = 1]
▶ Decision Function: t-test of size � given by
�(X ;�) = I
{√n(X − �0)
S≥ tn−1;�
}
▶ Decision function depends on the size index �.
▶ Decision Process:
Δ = (�(�) ≡ �(⋅;�) : � ∈ [0, 1])
▶ Size Condition:
sup{EP [�(X ;�)] : �(P) = 0} ≤ �
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Multiple Decision Process
▶ Consider a multiple decision problem with M components.
▶ Multiple Decision Process:
Δ = (Δm : m ∈ ℳ = {1, 2, . . . ,M})
▶ Decision Process for mth Component:
Δm = (�m(�) : � ∈ [0, 1])
▶ Example: t-test decision process for each component.
▶ Usual Approach: Pick a �m from Δm using the same �.
▶ Common Choices for �: (weak) FWER Threshold of q use:
Bonferroni: � = q/M
Sidak: � = 1− (1− q)1/M
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Notion of Size Functions
▶ A size function is a function
A : [0, 1] → [0, 1]
which is continuous, strictly increasing, A(0) = 0 andA(1) ≤ 1, and possibly differentiable.
▶ Bonferroni size function: A(�) = �/M
▶ Sidak size function: A(�) = 1− (1− �)1/M
▶ S: collection of possible size functions.
▶ Given a decision process Δ and a size function A, we choosethe decision function from Δ according to
�[A(�)].
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Multiple Decision Size Function
▶ For a multiple decision problem with M components, amultiple decision size function is
A = (Am : m ∈ ℳ) with Am ∈ S.
▶ Condition:1−
∏
m∈ℳ
[1− Am(�)] ≤ �
▶ Given a Δ = (Δm : m ∈ ℳ) and an A = (Am : m ∈ ℳ),multiple decision function is chosen according to
�(�) = (�m[Am(�)] : m ∈ ℳ)
▶ Weak FWER of �(�):
R0(�(�),P) = 1−∏
[1− Am(�)] ≤ �
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Neyman-Pearson Paradigm
▶ Control Type I error rate; minimize Type II error rate.
▶ Desired Type I error threshold: q ∈ (0, 1)
▶ Weak Control: For P with �m(P) = 0 for all m, want a � with
R0(�,P) ≤ q or R1(�,P) ≤ q.
▶ Strong Control: Whatever P is, want a � such that
R0(�,P) ≤ q or R1(�,P) ≤ q.
▶ And, if above Type I error control is achieved, we want tohave R2(�,P) small, if not optimal.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Towards Strong FWER Control
Given a MDP Δ = (Δm) and MDS A = (Am), for the chosen � at�, its FWER is
R0(�,P) = EP
{
I(
∑
�m[Am(�)][1 − �m(P)] > 0)}
= P
⎧
⎨
⎩
∑
ℳ0
�m[Am(�)] > 0
⎫
⎬
⎭
= 1−∏
ℳ0
[1− Am(�)]
= 1−∏
[1− Am(�)]1−�m(P)
Question: Given a threshold of q, what is the best �?
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
‘Best’ Choice of �
▶ Oracle Paul’s Choice:
�†(q;P) = inf{
� ∈ [0, 1] :∏
[1− Am(�)]1−�m(P) < 1− q
}
▶ But, P is unknown, hence �m(P) is also unknown.
▶ However, we could estimate �m(P) by
�m[Am(�)−].
▶ The Oracle’s choice is then estimated by
�†(q) = inf{
� ∈ [0, 1] :∏
[1− Am(�)]1−�m [Am(�)−] < 1− q
}
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Strong FWER-Controlling MDF
▶ Chosen Multiple Decision Function:
�†(q) =(
�m[Am(�†(q))] : m ∈ ℳ
)
▶ Theorem (Pena, Habiger, Wu, 2011, Ann Stat)
Given a Δ = (Δm) and an A = (Am), the �†(q) defined above has
R0(�†(q),P) ≤ q,
whatever P is. Thus, it is an MDF achieving strong FWER control
at level q.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Towards FDR Control
▶ Given MDP Δ = (Δm) and MDS A = (Am), the MDF
�(�) = (�m[Am(�)] : m ∈ ℳ)
has FDR
R1(�(�),P) = EP
{∑
�m[Am(�)](1− �m(P))∑
�m[Am(�)]
}
▶ Observe:
EP
{
∑
�m[Am(�)](1− �m(P))}
≤∑
Am(�)
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
‘Best’ Choice of �
▶ Preceding considerations heuristically suggest the �:
�∗(q) = sup{
� ∈ [0, 1] :∑
Am(�) ≤ q∑
�m[Am(�)]}
▶ Chosen Multiple Decision Function:
�∗(q) = (�m[Am(�∗(q))] : m ∈ ℳ)
▶ Theorem (Pena, et al, 2011, Ann Stat)
Given a pair (Δ,A), the MDF �∗(q) achieves FDR control at level
q in that
R1(�∗(q),P) ≤ q,
whatever P is.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Classes of MDFs Controlling FWER and FDR
▶ A class of strong FWER-controlling MDFs at threshold q is:
D† =
{
�†(q;Δ,A) : Δ ∈ D,A ∈ S
}
▶ A class of FDR-controlling MDFs at threshold q is:
D∗ = {�∗(q;Δ,A) : Δ ∈ D,A ∈ S}
▶ Remark: Sidak’s sequential step-down strong FWERcontrolling MDF belongs to D
†.
▶ Remark: Benjamini-Hochberg’s step-up FDR controlling MDFbelongs to D
∗.
▶ Potential Utility: May choose best MDF in D† or D∗ wrt the
missed discovery rate.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Recalling BH FDR-Controlling MDF
▶ Benjamini-Hochberg (JRSS B, ’95) paper. Most well-knownFDR-controlling procedure.
▶ Let P1,P2, . . . ,PM be the ordinary P-values from the M tests.
▶ Let P(1) < P(2) < . . . < P(M) be the ordered P-values.
▶ For FDR-threshold equal to q, define
K = max
{
k ∈ {0, 1, 2, . . . ,M} : P(k) ≤qk
M
}
.
▶ BH MDF �BH(q) = (�BHm : m ∈ ℳ) has
�BHm (X ) = I{
Pm ≤ P(K)
}
, m ∈ ℳ.
▶ Simple and easy-to-implement, but is it the BEST?
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Applying BH Procedure to a Two-Group Microarray Data
▶ Agilent Technology microarray data set from M. Pena’s lab.Jim Ryan of NOAA did the microarray analysis.
▶ M = 41267 genes.
▶ 2 groups, each group with 5 replicates.
▶ Applied t-test for each gene, using logged expression values.P-values obtained.
▶ Applied Benjamini-Hochberg Procedure with q = .15 to pickout the significant genes from the M = 41267 genes.
▶ Procedure picked out 2599 significant genes.
▶ Further analyzed the top (wrt to their p-values) 200 genesfrom these selected genes.
▶ Performed a cluster analysis on these 200 genes.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Histogram of the P-Values from the t-Tests
Histogram of data$P.CTFL
P−Values for CT versus FL
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
010
0020
0030
0040
0050
0060
00
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Scatterplot of the Pairwise Gene Means
2 4 6 8 10 12
24
68
1012
Significant and Chosen Genes
logCTmean
logFL
mea
n
A GeneSigGeneChosen
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Heatmap of the 200 Top Genes
CT1
CT2
CT3
CT4
CT5
FL1
FL2
FL3
FL4
FL5
Bcl11b0710005M24RikTrim66Slc15a3Trim66Ap3b2Sdk1Csf2Cpb1Cldn23Snap25Atp2b3SacyGrhl2BC0047286330404F12RikLgr6A130039I20RikE330032C10Add2Lrp4Lpin3BU531427Car9Gng2Rab15Btbd11Rab11fip1Tmcc3Cntnap4LpxnWdr18Grin3bRhcedTnfsf13bSlc38a1Gng2TC1019243Actn3LOC236749Dysf1700027L20RikChrna1PgfGdpd3Lgr6Add2G431002C21RikIvns1abpTC1087988Mtap1aKctd12bAdam12Slc1a3Ces14921504I05RikBbs1Tph2ArhgdigD630002G06Pdgfb8430408G22RikChst7Abca13Adam12Herc3Sfmbt2CsenCpxm12510009E07RikCav20610009O20RikNanos1D930005D10Rik6720457D02RikBC049806Eef1d4833417J20RikAcss2Ablim1AK028065Asb51700001L05RikAK036268C730048C13RikPrkcmVax2SetmarWnt6Ablim1AHFAI646023Tcrb−V13Tcrb−V13Tcrb−V13Parc7530403E16RikMtap6Mtap6Anxa85730410E15RikRgs17PtprgMafKcnab2Llglh2C130065N10RikHist1h2bcSpnb2Zfp87Dusp181300007F04RikTC1086095NAP046356−12010007H06RikPtgdsFgd3Plagl1CalcrlSpata9Fabp3Rab3il1Prss16Esm1D17H6S56E−5Prrx2Trim47PmvkSqleDekCrip2PmvkHmgcs1Plac9C230078M08RikOmdAcss2NAP034365−1BambiS100a150610031P221110020G09RikMgst35430405N12RikD14Ertd449eHsd17b71700025G04RikFnbp16230416A05RikAA986860Slc9a3r1CryabMtap6Kif1bD930005D10RikPlekha6Insig1Nsd1Anxa6AfpMdm2Lrp2bp1190002H23RikIfi27Ifi27Ifi271810015M01RikRdh11LssTbcdCrip1Fdft1HrblKhkSt3gal5Ninj1Foxc12010111I01RikIqgap1NsdhlCyp51MvkPcdh18Stard4D5Bwg0834eOact19030611O19RikNM_029423Unc5aSnta1Acss2Efnb21810020D17Rik1110020G09RikHabp4E2f2Mgst35033430I15RikD5Bwg0834eB230205M03
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Pictorial Depiction of Gene Clusters of Top 200 Genes
2 4 6 8 10
24
68
10Clusters in CT vs FL Space
Means of Log CT
Mean
s of L
og FL
1234567891011121314151617181920
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Can We Obtain a Better MDF than BH?
▶ IDEA: Given MDP Δ = (Δm : m ∈ ℳ), we find the optimalMDS A∗ ≡ A∗(Δ) ∈ S achieving smallest MDR
R2[(Δ ∘ A)(�),P1] =1
M
∑
{1− �m [Am(�)]} .
▶ �m(�) = POWER of �m(�)
▶ FWER-controlling MDF:
�†(q) = �†(q;Δ,A∗(Δ))
▶ FDR-controlling MDF:
�∗(q) = �∗(q;Δ,A∗(Δ))
▶ Use the best MDP Δ, e.g., MPs; UMPs; UMPUs; UMPIs.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Role of Power or ROC Functions
▶ P-value based procedures ignore differences in powers.
▶ Neyman and Pearson: power germane in search for optimality.
▶ Power of mth Test: �m(�) = EPm1{�m(X ;�)}
▶ ROC Function for mth Decision Process Δm:
� 7→ �m(�)
▶ ROC functions in the missed discovery rate.
▶ Enables exploiting differences in the ROC functions.
▶ Why Power or ROC Differences? Different effect sizes,decision processes, or dispersion parameters.
▶ EXCHANGEABILITY: EXCEPTION rather than RULE!
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Case with Simple Nulls and Simple Alternatives
▶ Neyman-Pearson Most Powerful Decision Process for each m.
▶ ROC Functions:� 7→ �m(�)
▶ ROC functions are concave, continuous, and increasing.
▶ Assume that they are also twice-differentiable.
Theorem (Pena, et al, 2011, Ann Stat)
Multiple decision size function (� 7→ Am(�) : m ∈ ℳ) is optimal
if it satisfies the M + 1 equilibrium conditions
∀m ∈ ℳ : �′m(Am)(1− Am) = � for some � ∈ ℜ;
∑
ℳ
log(1− Am) = log(1− �).
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Example: Optimal Multiple Decision Size Function
▶ M = 2000
▶ For each m: Xm ∼ N(�m, � = 1)
▶ Multiple Decision Problem: To test
Hm0 : �m = 0 versus Hm1 : �m = m.
▶ Effect Sizes: mIID∼ ∣N(0, 3)∣
▶ For each m, Neyman-Pearson MP decision process.
Δm = (�m(�) : � ∈ [0, 1])
�m(xm;�) = I{xm ≥ Φ−1(1− �)}▶ Power or ROC Function for the mth NP MP Decision Process:
� 7→ �m(�) = 1− Φ[
Φ−1(1 − �)− m]
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Optimal Test Sizes vs Effect Sizes
Effect Size
De
nsi
ty H
isto
gra
m
0 2 4 6 8 10 12 14
0.0
00
.10
0.2
0
0 2 4 6 8 10 12 14
0e
+0
04
e−
05
8e
−0
5
Effect Size
Op
tima
l Te
st S
ize
0e+00 2e−05 4e−05 6e−05 8e−05
0.0
0.2
0.4
0.6
0.8
1.0
Optimal Test Size
Op
tima
l Te
st P
ow
er
0 2 4 6 8 10 12 14
0.0
0.2
0.4
0.6
0.8
1.0
Effect Size
Po
we
r (B
lue
=O
ptim
al;
Re
d=
Sid
ak)
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Economic Aspect: A Size-Investing Strategy
▶ Do not invest your size on those where you will not makediscoveries (small power) or those that you will certainly makediscoveries (high power)!
▶ Rather, concentrate on those where it is a bit uncertain, sinceyour differential gain in overall discovery rate would begreater!
▶ Some Wicked Consequences▶ Departmental Merit Systems.▶ Graduate Student Advising.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
BH MDF versus �∗(q): q∗ = .1; M = 20; 1000 Reps
� p �∗F -FDR �∗F -MDR∗ �BH -FDR �BH -MDR∗
1 0.1 8.03 70.80 8.43 72.641 0.2 7.55 79.64 8.77 81.991 0.4 6.05 77.47 6.65 80.30
2 0.1 7.70 54.42 8.43 55.802 0.2 7.39 56.32 7.59 57.312 0.4 6.47 47.82 6.21 49.38
4 0.1 9.14 8.62 9.48 10.304 0.2 7.80 7.34 6.97 9.204 0.4 6.15 3.58 5.65 5.53
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
BH MDF versus �∗(q): q∗ = .1; M = 100; 1000 Reps
� p �∗F -FDR �∗F -MDR∗ �BH -FDR �BH -MDR∗
1 0.1 9.14 87.10 9.02 90.021 0.2 8.21 84.05 8.78 87.381 0.4 5.92 80.12 5.88 83.73
2 0.1 9.79 66.10 9.24 67.932 0.2 7.68 58.25 7.94 59.932 0.4 5.74 49.29 6.10 50.90
4 0.1 8.37 10.44 8.62 12.364 0.2 7.72 5.93 7.81 8.224 0.4 5.69 3.80 6.14 5.72
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Potential Applications and Concluding Remarks
▶ Microarray data analysis: which genes are important?
▶ Systems analysis (Biological Pathways?): which components(subsystems of genes) are relevant?
▶ Variable selection: which predictor variables are important?
▶ For each gene, component, or predictor variable, apply adecision function to decide whether, say, independence ordependence holds with respect to the response variable.
▶ Test for Independence: Kendall’s procedure, for example.
▶ Use MDFs �†(q) or �∗(q).
▶ Issues of determining effect sizes to determine power or ROCfunctions still need further studies.
▶ Comparison with other methods, such as those usingregularization?
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making
Acknowledgements
▶ Wensong Wu, my PhD student at USC Stat; Asst Prof at FIU.
▶ Josh Habiger, my PhD student at USC Stat; Asst Prof atOklahoma State.
▶ Marge Pena (Biology), Yu Zhang (Biology), and James Ryan(NOAA).
▶ NSF and NIH for research support.
Edsel A. Pena University of South Carolina Columbia, South Carolina (E-Mail: [email protected])Statistical Multiple Decision Making