bayesian networks – (structure) learning · sinus headache nose 16 . 9 what you need to know...
TRANSCRIPT
1
Bayesian Networks – (Structure) Learning
Machine Learning – CSE446 Carlos Guestrin University of Washington
May 31, 2013 ©Carlos Guestrin 2005-2013 1
Review n Bayesian Networks
¨ Compact representation for probability distributions
¨ Exponential reduction in number of parameters
n Fast probabilistic inference ¨ As shown in demo examples ¨ Compute P(X|e)
n Today ¨ Learn BN structure
Flu Allergy
Sinus
Headache Nose
©Carlos Guestrin 2005-2013 2
2
Learning Bayes nets
x(1) …
x(m)
Data
structure parameters
CPTs – P(Xi| PaXi)
©Carlos Guestrin 2005-2013 3
Learning the CPTs
x(1) …
x(m)
Data For each discrete variable Xi
©Carlos Guestrin 2005-2013 4
3
Information-theoretic interpretation of maximum likelihood 1
n Given structure, log likelihood of data:
Flu Allergy
Sinus
Headache Nose
©Carlos Guestrin 2005-2013 5
Information-theoretic interpretation of maximum likelihood 2
n Given structure, log likelihood of data:
Flu Allergy
Sinus
Headache Nose
©Carlos Guestrin 2005-2013 6
4
Information-theoretic interpretation of maximum likelihood 3
n Given structure, log likelihood of data:
Flu Allergy
Sinus
Headache Nose
©Carlos Guestrin 2005-2013 7
Decomposable score
n Log data likelihood
n Decomposable score: ¨ Decomposes over families in BN (node and its parents) ¨ Will lead to significant computational efficiency!!! ¨ Score(G : D) = ∑i FamScore(Xi|PaXi : D)
©Carlos Guestrin 2005-2013 8
5
How many trees are there? Nonetheless – Efficient optimal algorithm finds best tree
©Carlos Guestrin 2005-2013 9
Scoring a tree 1: equivalent trees
©Carlos Guestrin 2005-2013 10
6
Scoring a tree 2: similar trees
©Carlos Guestrin 2005-2013 11
Chow-Liu tree learning algorithm 1
n For each pair of variables Xi,Xj ¨ Compute empirical distribution:
¨ Compute mutual information:
n Define a graph ¨ Nodes X1,…,Xn ¨ Edge (i,j) gets weight
©Carlos Guestrin 2005-2013 12
7
Chow-Liu tree learning algorithm 2
n Optimal tree BN ¨ Compute maximum weight
spanning tree ¨ Directions in BN: pick any
node as root, breadth-first-search defines directions
©Carlos Guestrin 2005-2013 13
Structure learning for general graphs
n In a tree, a node only has one parent
n Theorem: ¨ The problem of learning a BN structure with at most d
parents is NP-hard for any (fixed) d>1
n Most structure learning approaches use heuristics ¨ (Quickly) Describe the two simplest heuristic
©Carlos Guestrin 2005-2013 14
8
Learn BN structure using local search
Starting from Chow-Liu tree
Local search, possible moves: • Add edge • Delete edge • Invert edge
Score using BIC
©Carlos Guestrin 2005-2013 15
Learn Graphical Model Structure using LASSO
n Graph structure is about selecting parents:
n If no independence assumptions, then CPTs depend on all parents:
n With independence assumptions, depend on key variables:
n One approach for structure learning, sparse logistic regression!
©Carlos Guestrin 2005-2013
Flu Allergy
Sinus
Headache Nose
16
9
What you need to know about learning BN structures
n Decomposable scores ¨ Maximum likelihood ¨ Information theoretic interpretation
n Best tree (Chow-Liu) n Beyond tree-like models is NP-hard n Use heuristics, such as:
¨ Local search ¨ LASSO
©Carlos Guestrin 2005-2013 17
Markov Decision Processes (MDPs)
Machine Learning – CSE446 Carlos Guestrin University of Washington
June 3, 2013 ©Carlos Guestrin 2005-2013 18
10
Reinforcement Learning
training by feedback
©Carlos Guestrin 2005-2013 19
Learning to act
n Reinforcement learning n An agent
¨ Makes sensor observations ¨ Must select action ¨ Receives rewards
n positive for “good” states n negative for “bad” states
[Ng et al. ’05]
©Carlos Guestrin 2005-2013 20
11
Markov Decision Process (MDP) Representation
n State space: ¨ Joint state x of entire system
n Action space: ¨ Joint action a= {a1,…, an} for all agents
n Reward function: ¨ Total reward R(x,a)
n sometimes reward can depend on action
n Transition model: ¨ Dynamics of the entire system P(x’|x,a)
©Carlos Guestrin 2005-2013 21
Discount Factors
People in economics and probabilistic decision-making do this all the time. The “Discounted sum of future rewards” using discount factor γ” is
(reward now) + γ (reward in 1 time step) + γ 2 (reward in 2 time steps) + γ 3 (reward in 3 time steps) + : : (infinite sum)
©Carlos Guestrin 2005-2013 22
12
The Academic Life
Define: VA = Expected discounted future rewards starting in state A VB = Expected discounted future rewards starting in state B VT = “ “ “ “ “ “ “ T VS = “ “ “ “ “ “ “ S
VD = “ “ “ “ “ “ “ D
How do we compute VA, VB, VT, VS, VD ?
A. Assistant
Prof 20
B. Assoc. Prof 60
S. On the Street
10
D. Dead
0
T. Tenured
Prof 400
Assume Discount
Factor γ = 0.9
0.7
0.7
0.6
0.3
0.2 0.2
0.2
0.3
0.6 0.2
©Carlos Guestrin 2005-2013 23