entropy & information -...
TRANSCRIPT
Entropy & Information Jilles Vreeken
29 May 2015
Question of the day
What is
information?
(and what do talking drums have to do with it?)
Bits and Pieces What are information a bit entropy mutual information divergence information theory …
Information Theory
Field founded by Claude Shannon in 1948, ‘A Mathematical Theory of Communication’
a branch of statistics that is essentially about
uncertainty in communication
not what you say, but what you could say
The Big Insight
Communication is a series of discrete messages
each message reduces
the uncertainty of the recipient of a) the series and b) that message
by how much
is the amount of information
Uncertainty
Shannon showed that uncertainty can be quantified, linking physical entropy to messages
and defined the entropy of
a discrete random variable 𝑋 as
𝐻(𝑋) = −�𝑃(𝑥𝑖)log 𝑃(𝑥𝑖)𝑖
Optimal prefix-codes
Shannon showed that uncertainty can be quantified, linking physical entropy to messages
A (key) result of Shannon entropy is that
− log2𝑃 𝑥𝑖
gives the length in bits of the optimal prefix code
for a message 𝑥𝑖
Codes and Lengths
A code 𝐶 maps a set of messages 𝑋 to a set of code words 𝑌
𝐿𝐶 ⋅ is a code length function for 𝐶
with 𝐿𝐶 𝑥 ∈ 𝑋 = |𝐶 𝑥 ∈ 𝑌| the length in bits of the code word y ∈ 𝑌 that 𝐶 assigns to symbol 𝑥 ∈ 𝑋.
Efficiency Not all codes are created equal. Let 𝐶1 and 𝐶2 be two codes for set of messages 𝑋 1. We call 𝐶1 more efficient than 𝐶2 if for all 𝑥 ∈ 𝑋, 𝐿1 𝑥 ≤
𝐿2(𝑥) while for at least one 𝑥 ∈ 𝑋, 𝐿1 𝑥 < 𝐿2 𝑥 2. We call a code 𝐶 for set 𝑋 complete if there does not exist a
code 𝐶𝐶 that is more efficient than 𝐶
A code is complete when it does not waste any bits
The Most Important Slide
We only care about code lengths
The Most Important Slide
Actual code words are of no interest to us whatsoever.
The Most Important Slide
Our goal is measuring complexity,
not to instantiate an actual compressor
My First Code Let us consider a sequence 𝑆
over a discrete alphabet 𝑋 = 𝑥1, 𝑥2, … , 𝑥𝑚 .
As code 𝐶 for 𝑆 we can instantiate a block code, identifying the value of 𝑠𝑖 ∈ 𝑆 by an index over 𝑋, which require a constant number of log2 |𝑋| bits
per message in 𝑆, i.e., 𝐿 𝑥𝑖 = log2 |𝑋|
We can always instantiate a prefix-free code with code words of lengths 𝐿 𝑥𝑖 = log2 |𝑋|
Codes in a Tree
root
0 00 01
1 10 11
Beyond Uniform What if we know
the distribution 𝑃(𝑥𝑖 ∈ 𝑋) over 𝑆 and it is not uniform?
We do not want to waste any bits, so using block codes is a bad idea.
We do not want to introduce any undue bias, so
we want an efficient code that is uniquely decodable without having to use arbitrary length stop-words.
We want an optimal prefix-code.
Prefix Codes A code 𝐶 is a prefix code iff there is no code word 𝐶 𝑥 that is an extension of another code word 𝐶(𝑥′). Or, in other words, 𝐶 defines a binary tree with the leaves as the code words. How do we find the optimal tree?
root
0 00 01
1
Shannon Entropy Let 𝑃(𝑥𝑖) be the probability of 𝑥𝑖 ∈ 𝑋 in 𝑆, then
𝐻(𝑆) = − � 𝑃(𝑥𝑖)log 𝑃(𝑥𝑖)𝑥𝑖∈𝑋
is the Shannon entropy of 𝑆 (wrt 𝑋)
(see Shannon 1948)
the ‘weight’, how often we see 𝑥𝑖
number of bits needed to identify 𝑥𝑖 under 𝑃
average number of bits needed per message 𝑠𝑖 ∈ 𝑆
Optimal Prefix Code Lengths What if the distribution of 𝑋 in 𝑆 is not uniform?
Let 𝑃(𝑥𝑖) be the probability of 𝑥𝑖 in 𝑆, then
𝐿(𝑥𝑖) = − log𝑃(𝑥𝑖)
is the length of the optimal prefix code for message 𝑥𝑖 knowing distribution 𝑃
(see Shannon 1948)
Kraft’s Inequality For any code C for finite alphabet 𝑋 = 𝑥1, … , 𝑥𝑚 ,
the code word lengths 𝐿𝐶 ⋅ must satisfy the inequality
� 2−𝐿(𝑥𝑖)
𝑥𝑖∈𝑋
≤ 1.
a) when a set of code word lengths satisfies the inequality,
there exists a prefix code with these code word lengths, b) when it holds with strict equality, the code is complete,
it does not waste any part of the coding space, c) when it does not hold, the code is not uniquely decodable
What’s a bit?
Binary digit smallest and most fundamental piece of information yes or no invented by Claude Shannon in 1948 name by John Tukey Bits have been in use for a long-long time, though Punch cards (1725, 1804) Morse code (1844) African ‘talking drums’
Morse code
Natural language
Punishes ‘bad’ redundancy: often-used words are shorter
Rewards useful redundancy:
cotxent alolws mishaireng/raeding
African Talking Drums have used this for efficient, fast, long-distance communication
mimic vocalized sounds: tonal language very reliable means of communication
Measuring bits
How much information carries a given string? How many bits?
Say we have a binary string of 10000 ‘messages’
1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000
obviously, all four are 10000 bits long. But, are they worth those 10000 bits?
So, how many bits?
Depends on the encoding!
What is the best encoding? one that takes the entropy of the data into account things that occur often should get short code things that occur seldom should get long code
An encoding matching Shannon Entropy is optimal
Tell us! How many bits? Please? In our simplest example we have
𝑃(1) = 1/100000 𝑃(0) = 99999/100000
|𝑐𝑐𝑐𝑐1| = −log (1/100000) = 16.61
|𝑐𝑐𝑐𝑐0| = −log (99999/100000) = 0.0000144
So, knowing 𝑃 our string contains
1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits
of information
Optimal….
Shannon lets us calculate optimal code lengths what about actual codes? 0.0000144 bits? Shannon and Fano invented a near-optimal encoding in 1948,
within one bit of the optimal, but not lowest expected
Fano gave students an option: regular exam, or invent a better encoding
David didn’t like exams; invented Huffman-codes (1952) optimal for symbol-by-symbol encoding with fixed probs.
(arithmetic coding is overall optimal, Rissanen 1976)
Optimality
To encode optimally, we need optimal probabilities
What happens if we don’t?
Measuring Divergence
Kullback-Leibler divergence from 𝑄 to 𝑃, denoted by 𝐷(𝑃 ‖ 𝑄), measures the number of bits
we ‘waste’ when we use 𝑄 while 𝑃 is the ‘true’ distribution
𝐷 𝑃 ‖ 𝑄 = �𝑃(𝑖) log𝑃 𝑖𝑄 𝑖
𝑖
Multivariate Entropy
So far we’ve been thinking about a single sequence of messages
How does entropy work for
multivariate data?
Simple!
Towards Mutual Information
Conditional Entropy is defined as
𝐻 𝑋 𝑌 = �𝑃 𝑥 𝐻(𝑌|𝑋 = 𝑥)𝑥∈X
‘average number of bits
needed for message 𝑥 ∈ 𝑋 knowing 𝑌𝐶
Symmetric
Mutual Information the amount of information shared between two variables 𝑋 and 𝑌
𝐼 𝑋,𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋
= ��𝑃 𝑥,𝑦 log𝑃 𝑥,𝑦𝑃 𝑥 𝑃 𝑦
𝑥∈𝑋𝑦∈𝑌
high 𝐼(𝑋,𝑌) implies correlation low 𝐼(𝑋,𝑌) implies independence
Information is symmetric!
Information Gain (small aside)
Entropy and KL are used in decision trees
What is the best split in a tree?
one that results in as homogeneous label distributions in the sub-nodes as possible: minimal entropy
How do we compare over multiple options? 𝐼𝐼 𝑇,𝑎 = 𝐻 𝑇 − 𝐻(𝑇|𝑎)
Goal: Finds sets of attributes that interact strongly
Task: mine all sets of attributes
such that the entropy over their values instantiations ≤ 𝜎
1.087 bits
Low-Entropy Sets
Theory of Computation
Probability Theory 1
No No 1887
Yes No 156
No Yes 143
Yes Yes 219
(Heikinheimo et al. 2007)
Low-Entropy Sets
Maturity Test Software Engineering
Theory of Computation
No No No 1570
Yes No No 79
No Yes No 99
Yes Yes No 282
No No Yes 28
Yes No Yes 164
No Yes Yes 13
Yes Yes Yes 170
(Heikinheimo et al. 2007)
Low-Entropy Trees Scientific Writing
Maturity Test
Software Engineering
Project
Theory of Computation
Probability Theory 1
(Heikinheimo et al. 2007)
Define entropy of a tree 𝑇 = 𝐴,𝑇1, … ,𝑇𝑘 as
𝐻𝑈 𝑇 = 𝐻(𝐴 ∣ A1, … , Ak) + ∑𝐻𝑈(𝑇𝑗)
The tree 𝑇 for an itemset 𝐴 minimizing 𝐻𝑈 𝑇 identifies directional explanations!
𝐻 𝐴 ≤ 𝐻(𝑆𝑆|𝑀𝑇, 𝑆𝑆𝑃,𝑇𝐶,𝑃𝑇) + 𝐻(𝑀𝑇|𝑆𝑆𝑃,𝑇𝐶,𝑃𝑃) + 𝐻 𝑆𝑆𝑃 + 𝐻 𝑇𝐶 𝑃𝑇 + 𝐻 𝑃𝑇
Entropy for Continuous-values
So far we only considered discrete-valued data
Lots of data is continuous-valued
(or is it)
What does this mean for entropy?
Differential Entropy
ℎ 𝑋 = −�𝑓 𝑥 log𝑓 𝑥 𝑐𝑥𝐗
(Shannon, 1948)
Differential Entropy
How about… the entropy of Uniform(0,1/2) ?
−� −2 log 2 𝑐𝑥 = − log 212
0
Hm, negative?
Differential Entropy
In discrete data step size ‘dx’ is trivial.
What is its effect here?
ℎ 𝑋 = −�𝑓 𝑥 log𝑓 𝑥 𝑐𝑥
𝐗
(Shannon, 1948)
Impossibru?
No.
But you’ll have to wait
till next week for the answer.
Conclusions
Information is related to the reduction in uncertainty of what you could say
Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc
Entropy for continuous data is… more tricky differential entropy is a bit problematic
Thank you! Information is related to the reduction in uncertainty
of what you could say Entropy is a core aspect of information theory lots of nice properties optimal prefix-code lengths, mutual information, etc
Entropy for continuous data is… more tricky differential entropy is a bit problematic