a tutorial on inference and learning in bayesian networks irina rish moninder singh ibm t.j.watson...
TRANSCRIPT
![Page 1: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/1.jpg)
A Tutorial on Inference and Learning in Bayesian Networks
Irina Rish Moninder Singh
IBM T.J.Watson Research Centerrish,[email protected]
![Page 2: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/2.jpg)
“Road map” Introduction: Bayesian networks
What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc
Probabilistic inference Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
![Page 3: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/3.jpg)
Bayesian Networks
= P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B)
P(A, S, T, L, B, C, D)
Conditional Independencies Efficient Representation
Θ) (G,BN
CPD: T L B D=0 D=10 0 0 0.1 0.90 0 1 0.7 0.30 1 0 0.8 0.20 1 1 0.9 0.1 ...
Lung Cancer
Smoking
Chest X-ray
Bronchitis
Dyspnoea
Tuberculosis
Visit to Asia
P(D|T,L,B)
P(B|S)
P(S)
P(C|T,L)
P(L|S)
P(A)
P(T|A)
[Lauritzen & Spiegelhalter, 95]
![Page 4: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/4.jpg)
Bayesian Networks Structured, graphical representation of
probabilistic relationships between several random variables
Explicit representation of conditional independencies
Missing arcs encode conditional independence
Efficient representation of joint pdf Allows arbitrary queries to be answeredP (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
![Page 5: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/5.jpg)
Example: Printer Troubleshooting (Microsoft Windows 95)
Print OutputOK
Correct Driver
UncorruptedDriver
CorrectPrinter Path
Net CableConnected
Net/LocalPrinting
Printer On and Online
CorrectLocal Port
Correct Printer
Selected
Local CableConnected
ApplicationOutput OK
PrintSpooling On
Correct Driver
Settings
Printer MemoryAdequate
NetworkUp
SpooledData OK
GDI DataInput OK
GDI Data Output OK
PrintData OK
PC to PrinterTransport OK
PrinterData OK
SpoolProcess OK
NetPath OK
LocalPath OK
PaperLoaded
Local DiskSpace Adequate
[Heckerman, 95]
![Page 6: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/6.jpg)
[Heckerman, 95]
Example: Microsoft Pregnancy and Child Care)
![Page 7: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/7.jpg)
[Heckerman, 95]
Example: Microsoft Pregnancy and Child Care)
![Page 8: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/8.jpg)
Independence Assumptions
Tuberculosis
Visit to Asia
Chest X-ray
Head-to-tail
Lung Cancer
Smoking
Bronchitis
tail-to-tail
Dyspnoea
Lung Cancer Bronchitis
Head-to-head
![Page 9: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/9.jpg)
Independence Assumptions Nodes X and Y are d-connected by nodes
in Z along a trail from X to Y if every head-to-head node along the trail is in Z
or has a descendant in Z every other node along the trail is not in Z
Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y
Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z
![Page 10: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/10.jpg)
Independence AssumptionsA variable (node) is conditionally independent of its
non-descendants given its parents
Lung Cancer
Smoking
Bronchitis
Dyspnoea
Tuberculosis
Visit to Asia
Chest X-ray
![Page 11: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/11.jpg)
Independence Assumptions
Cancer
Smoking
Lung Tumor
Diet
Serum Calcium
Age Gender
Exposure to Toxins
Cancer is independentof Diet given Exposure to Toxinsand Smoking
[Breese & Koller, 97]
![Page 12: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/12.jpg)
Independence AssumptionsWhat this means is that joint pdf can be represented as product
of local distributions
P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C)
= P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B)
Lung Cancer
Smoking
Bronchitis
Dyspnoea
Tuberculosis
Visit to Asia
Chest X-ray
![Page 13: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/13.jpg)
Thus, the General Product rule for Bayesian Networks is
P(X1,X2,…,Xn) = P(Xi | Pa(Xi))
where Pa(Xi) is the set of parents of Xi
Independence Assumptions
i=1
n
![Page 14: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/14.jpg)
The Knowledge Acquisition Task Variables:
collectively exhaustive, mutually exclusive values clarity test: value should be knowable in principle
Structure if data available, can be learned constructed by hand (using “expert” knowledge) variable ordering matters: causal knowledge usually
simplifies
Probabilities can be learned from data second decimal usually does not matter; relative probs sensitivity analysis
![Page 15: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/15.jpg)
The Knowledge Acquisition Task
Fuel Gauge StartBattery TurnOver
Variable Order is Important
BatteryTurnOverStart FuelGauge
Fuel
Gauge
Start
Battery
TurnOver
Causal Knowledge Simplifies Construction
![Page 16: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/16.jpg)
Naive Baysian Classifiers [Duda&Hart; Langley 92]Naive Baysian Classifiers [Duda&Hart; Langley 92]
Selective Naive Bayesian Classifiers [Langley & Sage 94]Selective Naive Bayesian Classifiers [Langley & Sage 94]
Conditional Trees [Geiger 92; Friedman et al 97]Conditional Trees [Geiger 92; Friedman et al 97]
The Knowledge Acquisition Task
![Page 17: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/17.jpg)
The Knowledge Acquisition Task
Selective Bayesian Networks [Singh & Provan, 95;96]
![Page 18: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/18.jpg)
Diagnosis: P(cause|symptom)=?
Medicine Bio-informatics
Computer troubleshooting
Stock market
Text Classification
Speechrecognition
Prediction: P(symptom|cause)=?
classmax Classification: P(class|
data) Decision-making (given a cost function) Data mining: induce best model from data
What are BNs useful for?
![Page 19: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/19.jpg)
What are BNs useful for?
Cause
Effect
Predictive Inference
Cause
Effect
Diagnostic Reasoning
Unknown butimportant
ImperfectObservations
Value
Decision
Known Predisposing
Factors
Decision Making - Max. Expected Utility
![Page 20: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/20.jpg)
What are BNs useful for?
SalientObservations
Fault 1Fault 2Fault 3
.
.
.
Assignmentof Belief
Act Now! Halt?Yes
No
Next BestObservation
(Value of Information)
New Obs.
Probability of fault “i”
Exp
ecte
d U
tili
ty
Do nothing
Action 2
Action 1
Value of Information
![Page 21: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/21.jpg)
Why use BNs?
Explicit management of uncertainty Modularity implies maintainability Better, flexible and robust decision
making - MEU, VOI Can be used to answer arbitrary queries -
multiple fault problems Easy to incorporate prior knowledge Easy to understand
![Page 22: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/22.jpg)
Application Examples Intellipath
commercial version of Pathfinder lymph-node diseases (60), 100 findings
APRI system developed at AT&T Bell Labs learns & uses Bayesian networks from data to identify
customers liable to default on bill payments
NASA Vista system predict failures in propulsion systems considers time criticality & suggests highest utility
action dynamically decide what information to show
![Page 23: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/23.jpg)
Application Examples Answer Wizard in MS Office 95/ MS Project
Bayesian network based free-text help facility uses naive Bayesian classifiers
Office Assistant in MS Office 97 Extension of Answer wizard uses naïve Bayesian networks help based on past experience (keyboard/mouse use)
and task user is doing currently This is the “smiley face” you get in your MS Office
applications
![Page 24: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/24.jpg)
Application Examples Microsoft Pregnancy and Child-Care
Available on MSN in Health section Frequently occuring children’s symptoms are linked
to expert modules that repeatedly ask parents relevant questions
Asks next best question based on provided information
Presents articles that are deemed relevant based on information provided
![Page 25: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/25.jpg)
Application Examples Printer troubleshooting
HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers
Microsoft has 70+ online troubleshooters on their web site use Bayesian networks - multiple faults models, incorporate
utilities
Fax machine troubleshooting Ricoh uses Bayesian network based troubleshooters at
call centers Enabled Ricoh to answer twice the number of calls in half
the time
![Page 26: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/26.jpg)
Application Examples
![Page 27: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/27.jpg)
Application Examples
![Page 28: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/28.jpg)
Application Examples
![Page 29: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/29.jpg)
Online/print resources on BNs
Conferences & Journals UAI, ICML, AAAI, AISTAT, KDD MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI
Books and Papers Bayesian Networks without Tears by Eugene
Charniak. AI Magazine: Winter 1991. Probabilistic Reasoning in Intelligent Systems
by Judea Pearl. Morgan Kaufmann: 1998. Probabilistic Reasoning in Expert Systems by
Richard Neapolitan. Wiley: 1990. CACM special issue on Real-world applications
of BNs, March 1995
![Page 30: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/30.jpg)
Online/Print Resources on BNs
Wealth of online information at www.auai.org Links to
Electronic proceedings for UAI conferences Other sites with information on BNs and
reasoning under uncertainty Several tutorials and important articles Research groups & companies working in this
area Other societies, mailing lists and conferences
![Page 31: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/31.jpg)
Publicly available s/w for BNs
List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html
several free packages: generally research only commercial packages: most powerful (&
expensive) is HUGIN; others include Netica and Dxpress
we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE
![Page 32: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/32.jpg)
“Road map” Introduction: Bayesian networks
What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc
Probabilistic inference Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
![Page 33: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/33.jpg)
Probabilistic Inference Tasks
X/A
a
*k
*1 e),xP(maxarg)a,...,(a
evidence)|xP(X)BEL(X iii
Belief updating:
Finding most probable explanation (MPE)
Finding maximum a-posteriory hypothesis
Finding maximum-expected-utility (MEU) decision
e),xP(maxarg*xx
)xU(e),xP(maxarg)d,...,(d X/D
d
*k
*1
variableshypothesis: XA
function utilityx variablesdecision
: )( :
UXD
![Page 34: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/34.jpg)
Belief Updating
lung Cancer
Smoking
X-ray
Bronchitis
Dyspnoea
P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
![Page 35: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/35.jpg)
Belief updating: P(X|evidence)=?
“Moral” graph
A
D E
CB
P(a|e=0) P(a,e=0)=
bcde ,,,0
P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=
0e
P(a) d
),,,( ecdahB
b
P(b|a)P(d|b,a)P(e|b,c)
B C
ED
Variable Elimination
P(c|a)c
![Page 36: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/36.jpg)
Bucket elimination Algorithm elim-bel (Dechter 1996)
b
Elimination operator
P(a|e=0)
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
![Page 37: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/37.jpg)
b
maxElimination operator
MPE
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
Finding Algorithm elim-mpe (Dechter 1996)
)xP(maxMPEx
),|(),|()|()|()(maxby replaced is
,,,,cbePbadPabPacPaPMPE
:
bcdea max
![Page 38: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/38.jpg)
Generating the MPE-tuple
C:
E:
P(b|a) P(d|b,a) P(e|b,c)B:
D:
A: P(a)
P(c|a)
e=0 e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
(a)hP(a)max arga' 1. E
a
0e' 2.
)e'd,,(a'hmax argd' 3. C
d
)e'c,,d',(a'h
)a'|P(cmax argc' 4.B
c
)c'b,|P(e')a'b,|P(d')a'|P(bmax argb' 5.
b
)e',d',c',b',(a' Return
![Page 39: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/39.jpg)
Complexity of inference
))((exp ( * dwnOddw ordering along graph moral of widthinduced the)(*
The effect of the ordering:
4)( 1* dw 2)( 2
* dw“Moral” graph
A
D E
CB
B
C
D
E
A
E
D
C
B
A
![Page 40: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/40.jpg)
Other tasks and algorithms
MAP and MEU tasks: Similar bucket-elimination algorithms - elim-map, elim-
meu (Dechter 1996) Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede
maximization (i.e. hypothesis or decision variables are eliminated last)
Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc.
![Page 41: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/41.jpg)
Relationship with join-tree clustering
))()())
(a || haPAbucket(a,b hP(b|a) ||bucket(B)(a,b hP(c|a) ||bucket(C)
P(d|a,b)bucket(D) P(e|b,c)bucket(E)
B
C
D
ED,C,B,A, :Ordering
ABC
BCE
ADBA cluster is a set of buckets (a “super-bucket”)
![Page 42: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/42.jpg)
Relationship with Pearl’s belief propagation in poly-trees
Pearl’s belief propagation for single-root query
1X
2Z
1Z
3U
1Y
1U
2U
3Z
1Z 2Z 3Z
1U 2U 3U
1X
1Y
)|(
)(
11
11
uzP
uZ
elim-bel using topological ordering and super-buckets for
families
Elim-bel, elim-mpe, and elim-map are linear for poly-trees.
)( 22uZ )( 33
uZ
)( 11xY
“Diagnostic support”
“Causal support”
)( 1x
![Page 43: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/43.jpg)
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
![Page 44: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/44.jpg)
Inference is NP-hard => approximations
exp(w*)) O(n
Approximations:
Local inference Stochastic simulations Variational approximations etc.
S
X D
BCC B
DX
![Page 45: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/45.jpg)
Local Inference Idea
![Page 46: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/46.jpg)
Bucket-elimination approximation: “mini-buckets”
Local inference idea: bound the size of recorded dependencies
Computation in a bucket is time and space exponential in the number of variables involved
Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables
![Page 47: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/47.jpg)
Mini-bucket approximation: MPE task
Split a bucket into mini-buckets =>bound complexity
XX gh )()()O(e :decrease complexity lExponentia n rnr eOeO
![Page 48: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/48.jpg)
Approx-mpe(i)
Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound]
Example: approx-mpe(3) versus elim-mpe
2* w 4* w
![Page 49: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/49.jpg)
Properties of approx-mpe(i)
Complexity: O(exp(2i)) time and O(exp(i)) time.
Accuracy: determined by upper/lower (U/L) bound.
As i increases, both accuracy and complexity increase.
Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter,
1999)
Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)
![Page 50: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/50.jpg)
Anytime Approximation
UL
L
U
mpe(i)-approxL mpe(i)-approxU
iii
ii
step
smallest theand largest the
solution return ,11
far so found solutionbest thekeepby computed boundlower by computed boundupper
available are resources space and time
0
returnend
if
While :Initialize
)mpe(-anytime
![Page 51: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/51.jpg)
Empirical Evaluation(Dechter and Rish, 1997; Rish, 1999)
Randomly generated networks Uniform random probabilities Random noisy-OR
CPCS networks Probabilistic decoding
Comparing approx-mpe and anytime-mpe
versus elim-mpe
![Page 52: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/52.jpg)
Random networks Uniform random: 60 nodes, 90 edges (200 instances)
In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results
Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.
q
iy
in qqyyxPi
parameter noise random),...,|0(1
1
![Page 53: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/53.jpg)
Anytime-mpe(0.0001) U/L error vs time
Time and parameter i
1 10 100 1000
Up
pe
r/L
ow
er
0.6
1.0
1.4
1.8
2.2
2.6
3.0
3.4
3.8 cpcs422b cpcs360b
i=1 i=21
CPCS networks – medical diagnosis(noisy-OR model)
Test case: no evidence
505.2 70.3anytime-mpe( ),
110.5 70.3anytime-mpe( ),
1697.6 115.8elim-mpe
cpcs422 cpcs360 AlgorithmTime (sec)
410 110
![Page 54: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/54.jpg)
log(U/L)
0 2 4 6 8 10 12 0
100
200
300
400
500
600
700
800
900
1000
Freq
uenc
y
log(U/L) histogram for i=10 on 1000 instances of random evidence
log(U/L) histogram for i=10 on 1000 instances of likely evidence
log(U/L)
0 1 2 3 4 5 6 7 8 9 10 11 12 0
100
200
300
400
500
600
700
800
900
1000
Freq
uenc
y
Effect of evidence
More likely evidence=>higher MPE => higher accuracy (why?)
Likely evidence versus random (unlikely) evidence
![Page 55: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/55.jpg)
Probabilistic decoding
Error-correcting linear block code
State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)
![Page 56: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/56.jpg)
approx-mpe vs. IBP
codes *w-low onbetter is mpe-approxcodes w*)-(high generatedrandomly onbetter is IBP
Bit error rate (BER) as a function of noise (sigma):
![Page 57: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/57.jpg)
Mini-buckets: summary
Mini-buckets – local inference approximation
Idea: bound size of recorded functions
Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random
problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on
low-induced width codes
![Page 58: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/58.jpg)
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Local inference Stochastic simulations Variational approximations
Learning Bayesian Networks Summary
![Page 59: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/59.jpg)
Approximation via Sampling
(MCMC) sampling Gibbs * weighinglikelihood *
:ues their val tonodes evidence clamping"" - sampling) forward (e.g., rejection-acceptance -
? Eevidence handle How to3.
, #
)(
:sfrequencieby iesprobabilit Estimate2. )x,...,x,(xs where),s,...,s(
: ( from samples Generate 1.
in
i2
i1
iN1
N
yYwithsamplesyYP
PN
SX)
![Page 60: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/60.jpg)
Forward Sampling(logic sampling (Henrion, 1988))
2 step and 1 5.: , and .4
)|( from sample 3. to .2
to# 1.
withconsistent samples :),...,( ordering an
samples, of # - evidence, - :1
goto
ixXEX
paxPxXn1i
N1sample
EN XXoancestral
NE
iii
iiii
n
sample rejectif
forFor
Output
Input
![Page 61: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/61.jpg)
Forward sampling (example)
1X
2X 3X
4X
)( 1xP
)|( 12 xxP
),|( 324 xxxP
)|( 13 xxP
)|( from sample 5.otherwise 1, fromstart and
samplereject 0, If .4)|( from Sample .3)|( from Sample .2
)( from Sample .1 sample generate//
0 :Evidence
3,244
3
133
122
11
3
xxxPx
xxxPxxxPx
xPxk
X
Drawback: high rejection rate!
![Page 62: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/62.jpg)
Likelihood Weighing(Fung and Chang, 1990; Shachter and Peot, 1990)
y Y wheres
EXi
1
)lescore(sampE)|y P(YThenscores normalize .7
)|P(ele)score(samp .6)|( from sample 5.
.4 to# 3.
.),...,( :nodes theof an Find2.
. assign , 1.
i
amples
i
iiii
i
n
iii
papaxPxX
EXN1sample
XXoorderingancestral
exEX
forFor
each For
Works well for likely evidence!
“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood
![Page 63: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/63.jpg)
Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC):create a Markov chain of samples
}){\|( from sample 5. .4
to# 3. , 2.
. , 1.
iiii
i
ii
iii
XXxPxXEX
N1samplevaluerandomxEX
exEX
forFor
each For each For
Advantage: guaranteed to converge to P(X)Disadvantage: convergence may be slow
![Page 64: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/64.jpg)
Gibbs Sampling (cont’d)(Pearl, 1988)
ij chX
jjiiii paxPpaxPXXxP )|()|(}){\|(
:locally computed is }){\|( :Important ii XXxP
iX )()( jj chX
jiii pachpaXM
Markov blanket:
nodesother all oft independen is parents), their andchildren, (parents,
Given
iX
blanketMarkov
![Page 65: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/65.jpg)
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Local inference Stochastic simulations Variational approximations
Learning Bayesian Networks Summary
![Page 66: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/66.jpg)
Variational Approximations
Idea: variational transformation of CPDs simplifies
inferenceAdvantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniquesDisadvantages: More complex and less general: must be
derived for each particular form of CPD functions
![Page 67: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/67.jpg)
Variational bounds: example
log(x) 1log
}1log{min
)log(
x
x
x
parameter lvariationa -
This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds: convex duality (Jaakkola and Jordan, 1997)
![Page 68: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/68.jpg)
Convex duality (Jaakkola and Jordan, 1997)
bounds. lowerconvex
bounds upper
function dualconcave
get we,)( For .2
)()( )()(
get weand
)}({min)(
)}({min)(:s.t. )( a hasit is )( If 1.
*
*
*
*
*
xf
xfxffxxf
xfxf
fxxf f ,xf
T
T
T
x
T
![Page 69: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/69.jpg)
Example: QMR-DT network(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))
Noisy-OR model:
ij
j
pad
dijii qqdfP )1()1()|0( 0
1d 2d kd
1f 2f 3f nf
600 diseases
4000 findings
1log- where
)|0(
,0
)-q(
edfP
ijij
jdii
ipajd ij
![Page 70: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/70.jpg)
Inference in QMR-DT
Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)
jii dj
fi
fi dPdfPdfP
dPdfPfdP)( )|( )|(
)()|(),(
01
j
ij
ifij
i
i
ipajd ij
d
padf
i
f
jdi
ee
e
][
0
0
0
0
0
1
0 )1(i
ipajd ij
f
jdie
Positive evidence “couples” the disease nodes
k,...,dd
fdPfdP2
),( )|( 1 :Inference
factorized
factorized
![Page 71: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/71.jpg)
Variational approach to QMR-DT(Jaakkola and Jordan, 1997)
ipajd
jijiii
ipajd iji
ipajd ij
dfifjdi
i
jdii
x
eeedfP
edfP
fdualconcaveexf
][)|1(
:by bounded be can 1)|1( Then
)1ln()1(ln)( a has and is )1ln()(
)(0
)()0(
0
*
**
The effect of positive evidence is now factorized (diseases are “decoupled”)
![Page 72: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/72.jpg)
Variational approximations
Bounds on local CPDs yield a bound on posterior
Two approaches: sequential and block Sequential: applies variational
transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters
Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors
![Page 73: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/73.jpg)
Block approach
distance (KL)Leibler - Kullback theis )||( where
)||(minarg Find
bounds iational their var withCPDs some replacingafter ionapproximat),|(
evidence given ofposterior exact )|(
*
PQDPQD
EYQEYEYP
)(
)(log)()||(
S SP
SQSQPQD
![Page 74: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/74.jpg)
Inference in BN: summary
Exact inference is often intractable => need approximations Approximation principles:
Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph).
Mini-buckets, IBP Other approximations: stochastic simulations, variational
techniques, etc. Further research:
Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which
approximation suits which problem structure Other approximation paradigms (e.g., other ways of
approximating probabilities, constraints, cost functions)
![Page 75: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/75.jpg)
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
![Page 76: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/76.jpg)
Why learn Bayesian networks?
Incremental learning: P(H) or
S C Learning causal relationships:
Efficient representation and inference
Handling missing data: <1.3 2.8 ?? 0 1 >
<9.7 0.6 8 14
18> <0.2 1.3 5 ?? ??
> <1.3 2.8 ?? 0 1
> <?? 5.6 0 10 ??
> ……………….
Combining domain expert knowledge with data
![Page 77: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/77.jpg)
Learning Bayesian Networks
Known graph
C
S
B
DX
Complete data: parameter estimation (ML, MAP)Incomplete data: non-linear parametric optimization (gradient descent, EM)
P(S)
P(B|S)
P(X|C,S)
P(C|S)
P(D|C,B)
– learn parameters
CS
B
DX
)ˆ Score(G max arg GG
C
S
B
DX
Unknown graphComplete data: optimization (search in space of graphs)Incomplete data:
EM plus Multiple Imputation,structural EM,mixture models
– learn graph and parameters
![Page 78: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/78.jpg)
Learning Parameters:complete data
ML-estimate: )|(logmax
DP - decomposable!
MAP-estimate
(Bayesian statistics))()|(logmax
PDP
Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa
X
C B
XPa
)|(
,
X
x
xPX
papa
Multinomial ) ML(
,
,,
xx
xx
X
X
X N
N
pa
papa
counts
) MAP(,,
,,,
xx
xx
xxx
XX
XX
X N
N
papa
papapa
Equivalent sample size
(prior knowledge)
![Page 79: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/79.jpg)
Complete data – local computations
Incomplete data (scorenon-decomposable):stochasticmethods
Learning graph structure
NP-hard optimization Heuristic search:
GmaxargFind )ˆ Score(G G
C
S
BC
S
B
Add S->B
C
S
B
Delete S->B
C
S
B
Reverse S->B
Constrained-based methods
Data impose independence relations (constraints)
![Page 80: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/80.jpg)
Learning BNs: incomplete data Learning parameters
EM algorithm [Lauritzen, 95] Gibbs Sampling [Heckerman, 96] Gradient Descent [Russell et al., 96]
Learning both structure and parameters Sum over missing values [Cooper &
Herskovits, 92; Cooper, 95] Monte-Carlo approaches [Heckerman, 96] Gaussian approximation [Heckerman, 96] Structural EM [Friedman, 98] EM and Multiple Imputation [Singh 97,98,00]
![Page 81: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/81.jpg)
Learning Parameters:incomplete data
EM-algorithm:iterate until convergence
Initial parameters
Current model)(G,Θ
Non-decomposable marginal likelihood (hidden nodes)
S X D C B <? 0 1 0
1> <1 1 ? 0
1> <0 0 0 ? ?
> <? ? 0 ?
1> ………
Data
S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..
Expected counts
Expectation Inference: P(S|X=0,D=1,C=0,B=1)
Update parameters (ML, MAP)
Maximization
![Page 82: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/82.jpg)
Learning Parameters:incomplete data
Complete-data log-likelihood is
E step
Compute E( Nijk | Yobs, M step
Compute
E( Nijk | Yobs, E( Nij | Yobs,
(Lauritzen, 95)
Nijk log ijk
![Page 83: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/83.jpg)
Learning structure: incomplete data
Depends on the type of missing data - missing independent of anything else (MCAR) OR missing based on values of other variables (MAR)
While MCAR can be resolved by decomposable scores, MAR cannot
For likelihood-based methods, no need to explicitly model missing data mechanism
Very few attempts at MAR: stochastic methods
![Page 84: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/84.jpg)
Learning structure: incomplete data
Approximate EM by using Multiple Imputation to yield efficient Monte-Carlo method
[Singh 97, 98, 00] trade-off between performance & quality learned network almost optimal approximate complete-data log-likelihood
function using Multiple Imputation yields decomposable score, dependent only on
each node & its parents converges to local maxima of observed-data
likelihood
![Page 85: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/85.jpg)
Learning structure: incomplete data
![Page 86: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/86.jpg)
Scoring functions:Minimum Description Length (MDL)
Learning data compression
Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL
||2
log),|(log)|(
NGDPDBNMDL
DL(Model)
DL(Data|model)
<9.7 0.6 8 14
18> <0.2 1.3 5 ?? ??
> <1.3 2.8 ?? 0 1
> <?? 5.6 0 10 ??
> ……………….
![Page 87: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/87.jpg)
Learning Structure plus ParametersLearning Structure plus Parameters
p Y D p Y M D p M DM
( | ) ( | , ) ( | )
No. of models is super exponential
Alternatives: Model Selection or Model Averaging
![Page 88: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/88.jpg)
Model SelectionModel Selection
Generally, choose a single model M*.Equivalent to saying P(M*|D) = 1
p Y D p Y M D( | ) ( | , )*
Task is now to: 1) define a metric to decide which model is best 2) search for that model through the space of all models
![Page 89: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/89.jpg)
One Reasonable Score:One Reasonable Score:Posterior Probability of a StructurePosterior Probability of a Structure
p S D p S p D S
p S p D S p S d
h h h
hs
hs
hs
( | ) ( ) ( | )
( ) ( | , ) ( | )
structureprior
parameterprior
likelihood
![Page 90: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/90.jpg)
Global and Local Predictive ScoresGlobal and Local Predictive Scores
[Spiegelhalter et al 93]
log ( | ) log ( | , , , )
log ( | ) log ( | , ) log ( | , , )
p D S p S
p S p S p S
hl l
h
l
m
h h h
x x x
x x x x x x
1 11
1 2 1 3 1 2
Bayes’ factor p D S p D Sh h( | )| ( | )0
Local is useful for diagnostic problems
![Page 91: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/91.jpg)
Local Predictive ScoreLocal Predictive ScoreSpiegelhalter et al. (1993)Spiegelhalter et al. (1993)
pred(S p y d d Shl l l
h
l
m
) log ( | , ,..., , ) x 1 1
1
Ydisease
X1
X2
Xn symptoms...
![Page 92: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/92.jpg)
Exact computation of Exact computation of p(D|Sh)
No missing dataNo missing data Cases are independent, given the model.Cases are independent, given the model. Uniform priors on parametersUniform priors on parameters discrete variablesdiscrete variables
p D S g ihi
i
n
( | ) ( , )
1
[Cooper & Herskovits, 92]
![Page 93: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/93.jpg)
Bayesian Dirichlet ScoreBayesian Dirichlet ScoreCooper and Herskovits (1991)Cooper and Herskovits (1991)
p D SN
Nh ij
ij ijj
q
i
nijk ijk
ijkk
ri i
( | )( )
( )
( )
( )
11 1
N X x
r X
q X
N N
ijk i i ij
i i
i i
ij ijkk
r
ij ijkk
ri i
:
:
:
# cases where = and =
number of states of
number of instances of parents of
ik Pa pa
1 1
![Page 94: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/94.jpg)
Learning BNs without specifying an orderingLearning BNs without specifying an ordering
n! ordering; ordering greatly affects the quality of n! ordering; ordering greatly affects the quality of network learned.network learned.
use conditional independence tests, and d-use conditional independence tests, and d-separation to get an orderingseparation to get an ordering
[Singh & Valtorta’ 95]
![Page 95: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/95.jpg)
Learning BNs via the MDL principleLearning BNs via the MDL principle
Idea: best model is that which gives the most Idea: best model is that which gives the most compact representation of the datacompact representation of the data
So, encode the data using the model plus encode So, encode the data using the model plus encode the model. Minimize this.the model. Minimize this.
[Lam & Bacchus, 93]
![Page 96: A Tutorial on Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,moninder@us.ibm.com](https://reader033.vdocument.in/reader033/viewer/2022061305/5513f2eb55034674748b5cf6/html5/thumbnails/96.jpg)
Learning BNs: summary
Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning:
parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL)
Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI))
Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.