![Page 1: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/1.jpg)
COMP 562: Introduction to Machine Learning
Lecture 9 : Information Theory, Bayesian Networks
Mahmoud Mostapha
Department of Computer Science
University of North Carolina at Chapel Hill
September 26, 2018
![Page 2: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/2.jpg)
COMP562 - Lecture 9
Plan for today:
I Basics of Information Theory
I Conditional Independence
I Bayesian Networks
![Page 3: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/3.jpg)
Information Theory
The fundamental problem of communication is that ofreproducing at one point either exactly or approximatelya message selected at another point.
(Claude Shannon, 1948)
Information Theory tackles problems such as
I communication over noisy channels
I compression
![Page 4: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/4.jpg)
Information Theory
Suppose we have a noisy channel that flips bits with probability f .Let x be message sent over the noisy channel and y receivedmessage.
![Page 5: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/5.jpg)
Basics of Information Theory
Suppose you communicate to your friends using one of the fourmessages a,b,c and d.
You might opt to encode messages as1
I Enc(a) = 00
I Enc(b) = 01
I Enc(c) = 10
I Enc(d) = 11
Note that the length of each message is 2 bits (notation|Enc(a)| = 2)
Regardless of the message you want to communicate we alwayshave to transmit 2 bits.
1this table is called a codebook
![Page 6: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/6.jpg)
Expected/average message length
The expected length of message is
Ep[|Enc(M)|] =∑
m∈{a,b,c,d}
p(M = m)|Enc(m)|
Since |Enc(m)| = 2 for all m, on average for each message you willuse ... 2 bits
![Page 7: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/7.jpg)
Basics of Information Theory
Suppose you knew something extra: the probability that aparticular message will need to be transmitted.
p(M) =
1/2, M = a1/4, M = b1/8, M = c1/8, M = d
Could you then take advantage of this information?
![Page 8: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/8.jpg)
Basics of Information Theory
Use short codewords for frequent messages, longer codewords forinfrequent messages
I p(M = a) = 1/2, Enc(a) = 0
I p(M = b) = 1/4, Enc(b) = 10
I p(M = c) = 1/8, Enc(c) = 110
I p(M = a) = 1/8, Enc(d) = 111
and expected message length is
1/2 ∗ 1 + 1/4 ∗ 2 + 1/8 ∗ 3 + 1/8 ∗ 3 = 1.75.
Hence, on average we save 0.25 bits per message using the abovecodebook.
![Page 9: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/9.jpg)
Entropy
You can show that the codeword length assignment that minimizesthe expected message length is − log2 p(m).
Given a random variable M distributed according to p entropy is
H(M) =∑m
p(M = m) [− log2 p(M = m)]
The entropy can be interpreted as number bits spent incommunication using the optimal codebook.
For simplicity, we may write H(p) when it is clear which randomvariable we are considering.
![Page 10: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/10.jpg)
Conditional entropy
H(X ) =∑m
p(X = m) [− log2 p(M = m)]
If we have access to extra information Y then we can computeconditional entropy
H(X |Y ) =∑y
∑x
p(X = x |Y = y) [− log2 p(X = x |Y = y)]
![Page 11: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/11.jpg)
Entropy – characteristics
H(p) =∑m
p(m) [− log2 p(m)]
Entropy is always non-negative
The more uniform the distribution the higher the entropy (I need 2bits for 4 messages with prob. 1/4).
The less uniform the distribution the lower the entropy (I need 0bits if I always send the same message).
![Page 12: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/12.jpg)
Entropy
0 0.25 0.5 0.75 10
0.2
0.4
0.6
0.8
1
Entropy of Bernoulli distribution [p, 1−p]
bits
p(x)
Maximized for uniform distribution p = 12
H
([1
2,
1
2
])=
1
2
(− log2
1
2
)+
1
2
(− log2
1
2
)=
1
2log2 2+
1
2log2 2 = 1
![Page 13: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/13.jpg)
Entropy
p 2
p1
Entropy of distribution [p1, p
2, 1−p
1−p
2]
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
H ([p1, p2, 1− p1 − p2]) = −p1 log2 p1−p2 log2 p2−(1−p1−p2) log2(1−p1−p2)
For which p1 and p2 is entropy the largest? What is the value ofentropy for that distribution?
![Page 14: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/14.jpg)
Entropy recap
I Entropy is evaluated on probability distributions
H(p) = −∑m
p(M = m) log2 p(M = m)
I Given a distribution of messages, entropy tells us the leastnumber of bits we would need to encode messages tocommunicate efficiently.
I Length of a code for a message should be − log2 p(m).Sometimes this can not be achieved.2
I Entropy is maximized for uniform distributions
H([1
K. . .
1
K]) = log2 K
2If p(m) 6= 2−c we get fractional code lengths.
![Page 15: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/15.jpg)
Cross-entropy
Suppose we were given message probabilities q.
We build our codebook based on q and the optimal averagemessage length is H(q).
But it turns out the q is incorrect and the true messageprobabilities are p.
Cross-entropy is the average message length under thesecircumstances
H(p, q) = −∑m
p(m) log q(m)
![Page 16: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/16.jpg)
Cross-entropy
Cross entropy
H(p, q) = −∑m
p(m) log q(m)
is always greater or equal than entropy
H(p) = −∑m
p(m) log p(m).
Using wrong codebook is always incurs cost in communication –using longer codes for less frequent messages.
![Page 17: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/17.jpg)
Kulback Leibler divergence
Had we known the true distribution p our average message lengthwould be
H(p) = −∑m
p(m) log p(m)
we didn’t and we are now on average using a longer message
H(p, q) = −∑m
p(m) log q(m).
How much longer?
H(p, q)− H(p) = −∑m
p(m) log q(m) +∑m
p(m) log p(m)
This difference is called Kullback-Leibler divergence
KL(p||q) = H(p, q)−H(p) = −∑m
p(m) log q(m)+∑m
p(m) log p(m)
![Page 18: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/18.jpg)
KL divergence
For discrete pdfs:
KL(p||q) =∑x
p(x) logp(x)
q(x)
For continuous pdfs:
KL(p||q) =
∫xp(x) log
p(x)
q(x)dx
![Page 19: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/19.jpg)
A couple of observations about KL-divergence
1. KL(p||q) ≥ 0
2. KL(p||q) = 0 if and only if p = q
3. it is not symmetric KL(p||q) 6= KL(q||p)
![Page 20: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/20.jpg)
Summary
I Entropy as a measurement of the number of bits needed tocommunicate efficiently.
I KL-divergence as a number of bits that could be saved byusing the right distribution
I KL-divergence as a distance between distributions.
![Page 21: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/21.jpg)
Maximum likelihood and KL minimization
Delta function or indicator function:
[x ] =
{1, if x is true
0, otherwise
Data/Empirical distribution:
f (x) =1
N
N∑i=1
[xi = x]
Importantly:∫xf (x)g(x)dx =
∫x
(N∑i=1
[xi = x]
)g(x)dx =
1
N
N∑i=1
g(xi )
![Page 22: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/22.jpg)
Maximizing likelihood is equivalent to minimizing KL
ALL(θ) =N∑i=1
1
Nlog p(xi |θ)
KL divergence between empirical distribution f (x) and p(x|θ):
KL(f ||p) =
∫xf (x) log
f (x)
p(x)dx
=
∫x
(1
N
N∑i=1
[xi = x]
)log
f (x)
p(x)dx
=N∑i=1
1
Nlog
f (xi )
p(xi )= −
N∑i=1
1
Nlog p(xi )︸ ︷︷ ︸
−ALL(θ)
+1
Nlog
1
N︸ ︷︷ ︸const.
![Page 23: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/23.jpg)
Maximizing likelihood is equivalent to minimizing KL
KL(f ||p) = −ALL(θ)
and hence
argminθ
KL(f ||p) = argmaxθ
ALL(θ) = argmaxθ
LL(θ)
![Page 24: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/24.jpg)
Mutual information
Mutual information between random variables is defined as
I (X ;Y ) = KL(p(X ,Y )||p(X )p(Y )) =∑x
∑y
p(x , y) logp(x , y)
p(x)p(y)
Mutual information can be expressed as a difference of entropy andconditional entropy.
I (X ;Y ) = H(X )− H(X |Y )
Note that mutual information is symmetric
I (X ;Y ) = I (Y ;X ) = H(Y )− H(Y |X )
We will use mutual information your homework to learn Bayesiannetworks.
![Page 25: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/25.jpg)
Information theory
Relevant concepts:
I Entropy
I Cross-entropy
I Kullback-Leibler divergence
I Mutual information
![Page 26: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/26.jpg)
Graphical models
Suppose we observe multiple correlated variables, such as words ina document or pixels in an image.
I How can we compactly represent the joint distribution p(x|θ)?I How can we use this distribution to infer one set of variables
given another in a reasonable amount of computation time?
I How can we learn the parameters of this distribution with areasonable amount of data?
Graphical models provide an intuitive way of representing andvisualising the relationships between many variables
![Page 27: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/27.jpg)
Representing knowledge through graphical models
I Nodes correspond to random variables
I Edges represent statistical dependencies between the variables
A graphical model is a way to represent a joint distribution bymaking conditional independence assumptions.
Graphical models = statistics × graph theory × computer science
![Page 28: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/28.jpg)
Conditional independence
Marginal independence you are familiar with
p(X |Y ,Z ) = p(X )
p(X ,Y ) = p(X )p(Y )
Conditional independence
p(X |Y ,Z ) = p(X |Z )
p(X ,Y |Z ) = p(X |Z )p(Y |Z )
And shorthand for this relationship is
X ⊥ Y |Z
Note that the relationship is symmetric.
![Page 29: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/29.jpg)
Examples of conditional independence
Shoe size ⊥ Gray hair | Age
Temperature inside ⊥ Temperature outside | Air conditioning is working
Pepsi or Coke ⊥ Sweetness | Restaurant chain
Federal funds rate ⊥ State of economy | Federal Reserve meeting notes
Dice roll outcome ⊥ Previous dice rolls | Dice is not loaded
![Page 30: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/30.jpg)
Benefits of capturing conditional independence
The obvious benefit is in compact representation.
Suppose we have a probability distribution p(X ,Y ,Z ) and randomvariables X ,Y ,Z can each assume 5 states.
To represent the distribution we need to store 53 probabilities.3
If we know thatX ⊥ Z |Y
then we can write
p(X ,Y ,Z ) = p(X |Y ,Z )p(Y |Z )p(Z ) = p(X |Y )p(Y |Z )p(Z )
and instead of storing one large table we store 3 significantlysmaller tables.4
3ok 53 − 1 because they need to sum to 14124 vs 44 entries, alternatively 125 vs. 55
![Page 31: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/31.jpg)
Benefits of capturing conditional independence
We can also capture such independencies in a graphical form.
X Y Z
p(X ,Y ,Z ) = p(X |Y )p(Y |Z )p(Z )
![Page 32: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/32.jpg)
Bayesian networks
A directed graphical model is a graphical model whose graph is adirected acyclic graph (DAG). Also known as Bayesian network orbelief network or causal network.
In DAGs the nodes can be ordered such that parents come beforechildren. This is called a topological ordering.
![Page 33: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/33.jpg)
Specifying a graphical model
Each of these nodes corresponds to a random variable.
X1 X2
X3
X4
X5
X6
X7
and each node has a conditional probability associated with it
p(Xi |Xpa(i))
where pa (i) is a list of parent nodes of node i , e.g.
p(X7|X5,X6)
![Page 34: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/34.jpg)
Specifying a graphical model
X1 X2
X3
X4
X5
X6
X7
This graphical model specifies a joint distribution
p(X1,X2,X3,X4,X5,X6,X7) =∏i
p(Xi |Xpa(i))
= p(X1) p(X2|X1) p(X3|X2) p(X4|X2)
p(X5|X3) p(X6|X4) p(X7|X5,X6)
![Page 35: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/35.jpg)
Another example of a graphical model
X1
X2 X3
X4 X5 X6 X7
p(X1,X2,X3,X4,X5,X6,X7) =∏i
p(Xi |Xpa(i))
= p(X1)p(X2|X1)p(X3|X1)p(X4|X2)
p(X5|X2)p(X6|X3)p(X7|X3)
![Page 36: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/36.jpg)
Example: naive Bayes classifiers as Bayesian networks
A naive Bayes classifier represented asa directed graphical model. We
assume there are D = 4 features, forsimplicity. Shaded nodes are
observed, unshaded nodes are hidden.
Tree-augmented naive Bayes(TAN) classifier for D = 4
features. In general, the treetopology can change depending
on the value of y.
p(y , x) = p(y)D∏j=1
p(xj |y)
![Page 37: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/37.jpg)
Example: Bayesian network for medical diagnosis
Asia Bayesian Network (Interactive Demo)
![Page 38: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/38.jpg)
Example: Bayesian network on Instagram hashtags
We downloaded images and captions from Instagram that weretagged #buterfly.
These captions were converted into binary vector of hashtagpresence/absences.
Bayesian network was constructed on these data.
For HW2, you will do that and much more on a differentInstagram dataset.
![Page 39: Mahmoud Mostapha - comp562fall18.web.unc.educomp562fall18.web.unc.edu/files/2018/09/Comp562_Lec9.pdfMahmoud Mostapha Department of Computer Science University of North Carolina at](https://reader034.vdocument.in/reader034/viewer/2022043012/5fac1845e4d8f224c7404c49/html5/thumbnails/39.jpg)
Today
I Information Theory
I Bayesian Networks