![Page 1: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/1.jpg)
Required Sample size for Bayesian network Structure learning
Samee Ullah Khan
and
Kwan Wai Bong Peter
![Page 2: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/2.jpg)
Outline
Motivation IntroductionSample Complexity
– Sanjoy Dasgupta– Russell Greiner– Nir Friedman– David Haussler
SummaryConclusion
![Page 3: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/3.jpg)
Motivation
John Works at a Pharmaceutical Company.Optimal Sample Size of a Clinical Trial? It’s a function of Both Statistical Significance of
the Difference and the Magnitude of Apparent difference between Performances.
Purpose: A tool (measure) for Public and Commercial vendors to plan clinical trials.
Looking For: Gain acceptance from potential users.Statistically Significance Evidence
![Page 4: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/4.jpg)
Motivation: Solution
Optimize the difference between the performances of both treatments.
Let C= diff (expected cost of new treatment –expected cost of old treatment)
![Page 5: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/5.jpg)
Motivation
C=0, m= users, is the difference in performance
![Page 6: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/6.jpg)
Motivation
C>0
![Page 7: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/7.jpg)
Motivation
C<0
![Page 8: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/8.jpg)
Motivation: Conclusion
Actual improvement in performance is known It may be extended to the uncertainty about the
amount of improvement. It is also possible to shift the functions 1` or
2`to right. Where ` is standard deviation of the posterior
distribution of unknown parameter .
![Page 9: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/9.jpg)
Motivation: Model
Paired Observations (X1,Y1),(X2,Y2)……..Xi is new clinical outcome Yi is old clinical outcomeLet Z be the objective function Zi=Xi-Yi (i=1,2,3……….)Assume that has normal density N(,2)Formulating our previous knowledge about assume a
prior density N(,2).Under the assumptions is a sufficient statistics for the
parameter .
![Page 10: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/10.jpg)
Introduction
Efficient learning -- more accurate models with less data – Compare: P(A) and P(B) vs joint P(A,B) former
requires less data! – Discover structural properties of the domain – Identifying independencies in the domain helps to
• Order events that occur sequentially • Sensitivity analysis and inference
Predict effect of actions – Involves learning causal relationship among variables
![Page 11: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/11.jpg)
Introduction
Why Struggle for Accurate Structure
![Page 12: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/12.jpg)
Introduction
Adding an Arc
– Increases the number of parameters to be fitted – Wrong assumptions about causality and
domain structure
![Page 13: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/13.jpg)
Introduction
Deleting an Arc
– Cannot be compensated by accurate fitting of parameters
– Also misses causality and domain structure
![Page 14: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/14.jpg)
Introduction
Approaches to Learning Structure– Constraint based
• Perform tests of conditional independence • Search for a network that is consistent with the
observed dependencies and independencies
– Score based • Define a score that evaluates how well the
(in)dependencies in a structure match the observations
• Search for a structure that maximizes the score
![Page 15: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/15.jpg)
Introduction
Constraints versus Scores– Constraint based
• Intuitive, follows closely the definition of BNs • Separates structure construction from the form of the
independence tests • Sensitive to errors in individual tests
– Score based • Statistically motivated • Can make compromises
– Both • Consistent---with sufficient amounts of data and
computation, they learn the correct structure
![Page 16: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/16.jpg)
Dasgupta’s model
Haussler’s extension of the PAC framework
Situation: fixed network structure Goal: To learn the conditional probability
functions accurately
![Page 17: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/17.jpg)
Dasgupta’s model
A learning algorithm A:– Given:
1) An approximation parameter > 02) A confidence parameter 0 < < 1
3) Variables drawn from a instance space X, x1, x2, …, xn
4) An oracle which generates randomly instances of X according to some unknown distribution P that we are going to learn
5) Some hypothesis class H
![Page 18: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/18.jpg)
Dasgupta’s model
– Output: hypothesis h H such that with probability > 1-
where
d(.,.) is a distance measure
hopt is the concept h’ H that minimizes d(P, h’)
),(),( opthPdhPd
![Page 19: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/19.jpg)
Dasgupta’s model: Distance measure
Most intuitive: L1 norm
Most popular: Kullback-Leibler divergence (relative entropy)
Minimizing dKL with respect to the empirically observed distribution is equivalent to solving the maximum likelihood problem
)(
)(log)(),(
xh
xP
XxxPhPKLd
![Page 20: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/20.jpg)
Dasgupta’s model: Distance measure
Disadvantage of dKL: unbounded
So, the measure adopted in this model is relative entropy by replacing log with ln.
![Page 21: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/21.jpg)
Dasgupta’s model
The algorithm, given m samples drawn from some distribution P, finds the best fitting hypothesis by evaluating each h(,)H(,) by computing the empirical log loss E(-ln h(,)) and returning the hypothesis with the smallest value, where H(,)H, called an (,)-bounded approximation of H.
![Page 22: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/22.jpg)
Dasgupta’s model
By using Hoeffding and Chernoff bounds, the number of samples needed is bounded by
Lower bound:
)31ln(218ln)
31(ln
22
222288 nnn kkn
32)(n
![Page 23: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/23.jpg)
Rusell Greiner’s claim
Many learning algorithms that determine which Bayesian network is optimal usually based on some measures such as log-likelihood, MDL, BIC. These typical measures are independent of the queries that will be posed.
Learning algorithms should consider the distribution of queries as well as the underlying distribution of events, and seek the BN with the best performance over the query distribution rather than the one that appears closest to the underlying event distribution.
![Page 24: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/24.jpg)
Russell Greiner’s model
LetV: set of the N variablesSQ: set of all possible legal statistical queriessq(x; y): a distribution over SQ
Suppose we fixed a network B over V, and let B(x|y) be the real-value probability that B returns for this assignment. Given distribution sq(.,.) over SQ, the “score” of B is
err(B)=errsq,p(B) if sq, p are clear from context
yxyxpyxByxsqBpsqerr
,
2)]|()|()[;()(,
![Page 25: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/25.jpg)
Russell Greiner’s model
Observation:– Any Bayesian network B* that encodes the underlying
distribution p(.), will in fact produce the optimal performance; i.e. err(B*) will be optimal
– This means that if we have a learning algorithm that produces better approximations to p(.) as it sees more training examples, then in the limit the sq(.) distribution becomes irrelevant.
![Page 26: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/26.jpg)
Russell Greiner’s model
Given a set of labeled statistical queries Q={<xi;yi;pi>}i, let
be the empirical score of the Bayesian net.
QpYX
Q pyxBQ
Berr;;
2)|(1
)(
![Page 27: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/27.jpg)
Russell Greiner’s model
Compute err(B):– #P-hard to compute the estimate of
err(B) from general statistical queries If we know that all queries encountered sq(x;y),
satisfy p(y) for some >0, then we only need
complete event examples, withexample queries to obtain an -close estimate, with probability at least 1-.
)(BerrQ
4
ln2
2),( SQM
}4
ln2
8],
4ln
2ln
2
8[
2max{
SQMSQMSQM
![Page 28: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/28.jpg)
Nir Friedman’s model
Review – BN is composed of two parts.
• DAG• Parameters encoding
– Setup• Let B* be a BN that describe the target distributions from
training samples.• Entropy Distance (Kullback-Leibler)
• Learn from Random Variables, decrease with N.
)(
)(log)(),(
xh
xP
XxxPhPKLd
![Page 29: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/29.jpg)
Nir Friedman’s model: Learning
Criteria:– Error Threshold – Confidence Threshold
N(,) sample size If the sample size is larger than N(,) then
Pr(D(PLrn()||P)>)< where Lrn() represents the learning routine.
If N(,) is MINIMAL the it is called sample complexity.
![Page 30: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/30.jpg)
Nir Friedman’s model:Notations
Vector Valued U={X1, X2,……Xn}– X,Y,Z Variables– x,y,z values
So B=<G,>– G is DAG are number of parameters xi|xi =P(xi|xi)
BN is minimal
n
iiXiXiX
n
iiXBPnXXXBP
1|)
1|(),...,2,1(
![Page 31: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/31.jpg)
Nir Friedman’s model:Learning
Given a training set wN={u1,……..un} of U
find B that best matches D.The loglikelihood of B:
Decomposing loglikelihood according to structure:
N
jjBNBN uPuuPBLL
11 ))(log(),...,(log)(
Au
Auuu
NAP A
jjAN if 0
if 1)(1 where)(1
1)(̂
![Page 32: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/32.jpg)
Nir Friedman’s model:Learning
So we can derive
Assume G has fixed structure, optimize
Argument is large networks not desirable
i ixix ixixixiN xPNBNLL
,|log),(ˆ)(
)|(ˆ| ixiNixix xP
)()(),( NGGLLGS NN
![Page 33: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/33.jpg)
Nir Friedman’s model: PSM
Penalized weighting function:MDL principle:
– Total description length of data
– AIC
– BIC
)(N
),( GS N
cN )(
NN log21
)(
),(max)))((,( GSLrnGphS NG
NN
![Page 34: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/34.jpg)
Nir Friedman’s model: Sample Complexity
Sample complexity – Log-likelihood and penality term– Random noise
Entropy distance
)()(
log)(),(xhxP
XxxPhPKLd
x
xQxPQP )()(1
![Page 35: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/35.jpg)
Nir Friedman’s model: Sample Complexity
Idealized case*),*(),*( GSGS NN
)21()(
),1ˆ(),2ˆ( GGN
NNGPNPDNGPNPD
GG
NN
*
)(
y x
xyyxy
logthen,log2 if ,4Let
![Page 36: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/36.jpg)
Nir Friedman’s model: Sample Complexity
Sub-sampling strategies in learning)(),()( ˆˆˆ iXNPiXiNPiXiNP HXHXH
m
NX
NXHXNPH
1log3
122
)1())()(ˆPr(
![Page 37: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/37.jpg)
Nir Friedman’s model: Summary
It can be shown on the sample complexity of BN using MDL
– Bound is loose– To search for an optimal structure is NP-hard
)1
loglog1
log1
log)1
((),( 34
ON
![Page 38: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/38.jpg)
David Haussler’s model
The model is based on prediction. The learner attempts to infer an unknown target concept f chosen from a concept class F of {0, 1} valued function.
For any given instance i, the learner predicts value of f(xi).
After the prediction, the learner is to the correct answer. It improves on the result.
![Page 39: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/39.jpg)
David Haussler’s model
Criteria for sample bounds:– Probability of f(xm+1) over (x1, f(x1)), …,
(xm,f(xm))
Cumulative mistakes made over m trialsThe model uses VC dimension
![Page 40: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/40.jpg)
VC
General condition for uniform convergence:
Definition:– Shattered set. Let X be the instance space and C the
concept class– SX, shattered by C– S’ S, c C which contains all S’ and none of S-S’– SX, C(S) S
] with consistent is )(|[Pr ShhDerrorChm
Ds
![Page 41: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/41.jpg)
David Haussler’s model
Information Gain– At instance m, the learner has observed f(x1),
…,f(xm) labels predict f(xm+1)
)(1log
)(
)(1log
]1),()(ˆ|)1()1(ˆ[ˆPrlog
)(1),(1
fm
fmV
fmV
miixfixfmxfmxfmPf
fmIfxP
mI
![Page 42: Required Sample size for Bayesian network Structure learning](https://reader035.vdocument.in/reader035/viewer/2022062321/568135f2550346895d9d62b9/html5/thumbnails/42.jpg)
David Haussler’s model
)(/)()(),( 111 fVfVffx mmmPm
)](log[E)]([E 11 ffI mPfmPf
))](1())(1())(()([E
))](([E
1111
1
fGffGf
fG
mmmmPf
mPf