module 3: probability and information theory on the iphoneffh8x/d/soi18f/module03.pdf · 3.2 nexus...

3.1

How does the iPhone store data efficiently?

MODULE 3: Probability and Information

Theory on the iPhone

Information content and likelihood

Probability

Random number generation

Basic information theory

Fundamental theorem of information theory

Entropy

Coding and entropy

Huffman coding

3.2

Nexus

For storing signals on the iPhone (think mp3, JPEG, MPEG,

H.264), the enabling technology is compression.

The science of compression is found in information theory.

Information theory requires probability, so we will learn/review

some basic concepts in probability.

Much of the communication capability and the signal

compression ability of the iPhone hinges on information theory.

3.3

Example: Encoding English text

Two approaches:

- ASCII-like: assign a fixed number of bits to each symbol

- MORSE-like: assign shorter sequences to more common

symbols

The Morse-like approach leads to shorter encoding.

But what is the best possible? What’s the minimum number

of bits per letter?

Should we try all possible encodings? Can we?

Can we quantify the amount of information present in

English text?

- Yes!

Information content can be measured!

Information content tells us how much storage is needed by

the best encoding method!

Value vs quantity: The amount of information is not the same

as its value. 1 gram of gold is more valuable than 1 gram of

iron, but they have the same mass.

3.4

A first step

What is a message?

A message determines a choice of one object from a set of

possible objects:

- Chance of rain: 0%, 1%, …, 100%

- RGB color of a pixel in a picture: (0-255,0-255,0-255)

- A letter in a text message: A,B,…,Z

- The time that a plane lands: 00:00 – 23:59

- The face that shows when a coin is flipped: H, T

- The number that shows up when a die is rolled: 1,2,…,6

How is the number of symbols needed to represent a message

depends on the number of possibilities?

- A message indicating who one in a competition with

1000 contestants

o With 1 bit, we can identify 1 of 2 objects

o With 2 bits, we can identify 1 of 2 × 2 = 4 objects

o With 𝑛 bits, we can identify 1 of 2𝑛 objects

o So 2𝑛 ≥ 1000

o log2 1000 = 9.966 → I need 10 binary symbols

The number of bits needed is the log of the number of

cases.

3.5

What if we didn’t use bits, but base-10

- log10 1000 = 3 → need 3 base-10 symbols

Information content can be represented in

- Bits: binary digits, log in base 2

- Dits or Hartleys: decimal digits, log in base 10

- Nats: log in base 𝑒 (natural logarithm)

- …

How do we convert from one unit to another?

- log𝑚 𝑁 = log𝑛 𝑁 × log𝑚 𝑛

- log2 𝑁 = log10 𝑁 × log2 10

- Information in bits = Information in Hartleys × (3.32)

A grayscale image with 1000 pixels uses 5 gray levels. How

many bits do we need?

- 𝑁 = 5. We need log2 5 = 2.32 → 3bits/pixel

- 3000 bits total

This is wasteful! What can we do?

- Group pixels together. Encode every 3 pixels

- 𝑁 = 53. log2 125 = 6.96 → 7bits/3pixels

- 1000

3× 7 = 2334 bits

This is called block encoding. It is more efficient but also

more complex!

So far: implicitly assumed all objects are equally likely

What if some options are more common than others?

3.6

Information Theory

Let’s discuss Information Theory,

founded by Claude Shannon in the

40's:

“A Mathematical Theory of Communication,” C. E.

Shannon, The Bell System Technical Journal, Vol. 27, July,

Oct. 1948. Cited ≥ 100,000 times

Much of the communication

capability and the signal compression

ability of the iPhone hinges on

information theory.

How much information (relatively) is conveyed by the following

statements?

o Tomorrow, the sun will rise in the east.

o Tomorrow, it will rain in Seattle.

o Tomorrow, it will rain in Phoenix.

o Tomorrow, Betsy DeVos will call you and explain the fast

Fourier transform

Photograph by Alfred Eisenstaedt

3.7

We might rank these from low to high in information content,

because the probability of each occurring is decreasing…

o Tomorrow, the sun will rise in the east.

P = 1, no information transferred.

o Tomorrow, it will rain in Seattle.

P = 158/365 = .43, rather likely, could guess either way

o Tomorrow, it will rain in Phoenix.

P = 36/365 = .10, rather unlikely, significant info

o Tomorrow, Secretary DeVos will call you and explain the

fast Fourier transform

P = 0 – this would be a major story!

Conclusion: Mathematical definition of information content is

tied to probability

Science of Information Fundamental Tenet IV:

Information content is inversely proportional to probability.

3.8

Basic Probability

Probability concepts are fundamental in information engineering.

Information content is related to probability of occurrence. The

more probable, the less information.

If P = 1, no information is communicated.

If P is near zero, a lot of information is communicated.

To quantify these ideas, we need to understand how to calculate

probabilities, and how to combine events that depend upon each

other.

We also need some probability theory to understand noise, which

plagues our ability to transmit information on the iPhone.

3.9

Engineering definition of probability is based on the idea of

relative frequency.

Consider an experiment with more than one outcome.

o Experiment could be a coin toss, card deal, picking a sock

from a drawer, sending a message

o Outcome is the identifiable result

A coin toss. outcomes: H,T

The roll of a die. outcomes: 1,2,3,4,5,6

The time that a plane lands. outcomes: time measured

from a given origin

o An event is a collection of outcomes:

𝐴 = {1,3,5} for rolling a die

𝐵 = {𝐻}, or 𝐶 = {𝐻, 𝑇} for tossing a coin

o Experiment is performed 𝑁 times

o The event 𝐴 occurs 𝑁𝐴 times

o Relative frequency is 𝑓𝐴 = 𝑁𝐴 / 𝑁

Ex: flip a coin 10 times, result is HTHHHTTHHH

𝑁 = 10, 𝑁𝐻 = 7, 𝑓𝐻 = 7/10 = .7

3.10

Relative frequency is not probability.

Probability of heads for a fair coin should be p(H) = ½ = .5

Intuitively, we believe that NH / N → p(H)

We can posit

𝑝(𝐴) = lim𝑁→∞

(𝑁𝐴

𝑁)

This is the Law of Large Numbers.

Calculating Probabilities

Symmetric case: The probability of each outcome is

1

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

o EX: die roll: 𝑝(1) = 1/6

o EX: coin toss: 𝑝(𝐻) = 1/2

Non-symmetric case: probability is estimated based on

observation

o EX: in English: 𝑝(𝑒) = 0.12, 𝑝(𝑞) = 0.001

o EX: in the US: 𝑝(𝑏𝑜𝑦) = 0.51

We assume symmetry, unless otherwise stated

3.11

The Sum Rule: The probability of an event is the sum of

probabilities of its outcomes

To calculate 𝑝(𝐴), we add up how many different ways 𝐴 can

occur and divide by the number of different possibilities.

EX: If a two-digit number is chosen at random, what is the

probability that it is divisible by 4?

o 90 possible choices 10 ≤ ⋯ ≤ 99

o 𝐴 = {12,16, … ,96}: 22 outcomes in 𝐴

o 𝑃(𝐴) =22

90= 0.24

We need to know how to count!

Permutations

We shuffle five cards. How many

possible arrangements are there?

o 5 possibilities for first card, 4

possibilities for second card, …, 1 possibility for fifth card

5 × 4 × 3 × 2 × 1 = 5! = 120

𝑛! = the number of ways 𝑛 objects can be arranged

Combinations

In how many ways can we pick three out of five cards?

3.12

5 possibilities for first card, 4 possibilities for second card, 3

possibilities for third card

5 × 4 × 3 = 5!/2! = 60

But order is not important, so we are over-counting:

o eg., KQA = KAQ = QKA= QAK = AQK = AKQ

o Each case is counted 3! times. Total number of

choices=60

3!= 10

𝑛!

(𝑛−𝑘)!𝑘!= (

𝑛𝑘

) = 𝐶𝑥𝑛 = the number of ways 𝑘 objects can be

chosen from 𝑛 objects

Examples

If we shuffle 5 cards, what is the probability that they end up

being sorted?

o Number of possible outcomes: 5! = 120

o Number of desired outcomes: 1

o Probability of being sorted = 1

120

If we pick three cards out of {10, 𝐽, 𝑄, 𝐾, 𝐴} at random, what is

the probability we pick K, Q, A?

o Number of possible outcomes = (53

) =5!

3!2!= 10

o Number of desired outcomes = 1

3.13

o Probability = 1

10

Odds of winning the Mega Millions Lottery:

o Choose 5 numbers ≤ 70 and 1 number ≤ 25

o One in (705

) × 25 = 302,575,350

o Probability 3.30 × 10−9

o If you kept playing on average you would win

3.30 × 10−9 × $275,000,000 = $0.91

What is the probability that a deck of 52 cards is in order after a

random shuffle?

1

52!= 1.24 × 10−68

The probability that five cards drawn from deck of 52 are

𝐴, 2, 3, 4, 5 of hearts?

1/(52!

47! × 5!) = 3.8 × 10−7

3.14

Monty Hall Problem

Pick door one, two, or three

One door has valuable prize, e.g, the Winnebago. The other two

have pigs and donkeys behind them.

Monty then opens one of the other doors, revealing a donkey.

Monty always gives you the following option: keep your original

choice OR switch to the other closed door.

What's the best strategy?

Could say: P(original) = ½, P(other door) = ½

Then it doesn't matter!

Wrong!

P(original) = 1/3,

P(switch) = 2/3

Try the applet.

“Monty Hall Paradox Explore” on iPhone

3.15

Still confused? Let's map it out:

Original choice has 1/3 chance Other two have 2/3

Monty always eliminates a bad one, so the remaining door has 2/3

chance of being the winner.

If you never switch, you only have the chance of picking the right

door at first.

Let's do the analysis with 1,000,000 doors. After choosing one,

Monty reveals 999,998 bad prizes.

Initial chance with 1,000,000 doors?

Chance after switch?

3.16

Independent Events

An important concept in probability theory is that of independent

events.

Two events are statistically independent if the probability of

occurrence of one event is not affected by the occurrence of the

other event.

Formally, we have P(A|B) = P(A) and P(B|A) = P(B) where

P(A|B) means “probability of event A given that event B has

occurred.”

Consider a deck of cards from which you draw a card at random.

The card could be red or black; heart, diamond, club or spade; A,

2, 3, 4 …, Q, or K.

o Some of these characteristics are independent. Some are not.

o If you know your card is a heart, for example, this provides

no information about the rank.

o If you know your card is red, then it has to be a heart or

diamond, but tells nothing about the rank.

o Thus, color and rank are independent.

o Suit and rank are independent.

o Color and suit are dependent.

3.17

Important: the joint probability (the probability that event A and

event B occur) is the product of individual probabilities if the

events are independent, that is, p(A & B) = p(A)p(B).

EX: p(heart) = 13/52 = 1/4, p(ace) = 4/52 = 1/13

Joint probability:

p(heart and ace) = (1/4)(1/13) = 1/52

(suit and rank are independent)

p(red) = 26/52 = 1/2

p(red and heart) = 1/4 not (1/2)*(1/4) = 1/8

p(black) = 26/52 = 1/2

p(black and heart) = 0 not (1/2)*(1/4) = 1/8

(more in APMA 3100, 3110 including Bayes Theorem)

3.18

Bernoulli Trials and the Binomial Distribution

Several problems in information science involve Bernoulli trials.

Bernoulli trials are independent experiments that have only two

outcomes (i.e., success / failure).

One outcomes has probability 𝑝 and the other 1 − 𝑝.

o A coin is flipped. 𝑝(𝐻) = 𝑝(𝑇) =1

2

o Bits are transmitted. They may be received correctly or

erroneously (value is flipped).

𝑝(𝑒𝑟𝑟𝑜𝑟) = 𝑝, 𝑝(𝑛𝑜 𝑒𝑟𝑟𝑜𝑟) = 1 − 𝑝

For example, 𝑝 = 10−3.

If we have a message of 𝑛 bits, the probability of having 𝑥 errors

(for a fixed error rate of p) is given by the binomial distribution.

𝑃𝑛(𝑥) = (𝑛𝑥

) 𝑝𝑥(1 − 𝑝)𝑛−𝑥

Where (𝑛𝑥

) = 𝐶𝑥𝑛 is the number of combinations of n things taken x

at a time:

(𝑛

𝑥) =

𝑛!

𝑥! (𝑛 − 𝑥)!

3.19

The Binomial distribution equation above makes sense. How so?

Let’s see with an example:

Three bits are sent. The probability of error for each bit is 𝑝. What

is the probability that two out of three bits are received

successfully?

Let’s look at the cases and their probabilities:

Event Probability

(1 − 𝑝) × (1 − 𝑝) × 𝑝

(1 − 𝑝) × 𝑝 × (1 − 𝑝)

𝑝 × (1 − 𝑝) × (1 − 𝑝)

All events with one error have the same probability. So only need

to count how many such events exist.

Out of the three bits, need to choose one that has error:

(31

) =3!

2! 1!= 3

More generally, out of the 𝑛 experiments, we need 𝑥 to be a

particular outcome: (𝑛𝑥

)

For each event, the probability is 𝑝𝑥(1 − 𝑝)𝑛−𝑥. So

𝑃𝑛(𝑥) = (𝑛𝑥

) 𝑝𝑥(1 − 𝑝)𝑛−𝑥

3.20

An old iPhone will erroneously flip 55% of bits received. What’s

the probability that the iPhone receives 9 of 10 correctly?

Random Number Generation

In addition to describing noise experienced on the iPhone, we'll

also need Random Number Generators (RNG's) for encryption

(MODULE 7).

What do you want in an RNG?

o Unbiased: all possible values are equally probable (assuming

uniform distribution)

o Unpredictable: given 𝑥1, 𝑥2, 𝑥3, … , 𝑥𝑛, it should be

impossible to predict 𝑥𝑛+1, 𝑥𝑛+2 etc.

o Irreproducible: Two RNG's with the same starting state will

have different outputs.

Can be implemented in hardware and software.

o Hardware RNG: use a noise source such as a diode to

produce a random number.

3.21

o Software RNG: use an algorithm to generate random

numbers. Need a starting point called a seed.

o Problem: starting with the same seed leads to reproducible

RNG pattern. This makes it a PRNG – pseudo RNG

In MATLAB, you can either set the seed yourself or let

MATLAB use the current time to determine it.

Here's a common method called the "linear congruential

generator" for job security reasons.

It's based on modular arithmetic.

[B] mod N is the remainder after dividing B by N

o B = I×N + R (where 0<=R<N)

o R = [B] mod N

o Ex: R = [36] mod 10 =

Another way: solve for R = N[B/N – floor(B/N)]

=10(3.6 – 3) =

Nutty question: what about [-36] mod 10 = ?

3.22

PRNG Algorithm

Let 𝑋1, 𝑋2, 𝑋3, … be a sequence of Pseudo Random integers

Rule: 𝑿𝒏 = [𝑨 ∗ 𝑿𝒏−𝟏 + 𝑩] 𝐦𝐨𝐝 𝑵

Where 𝑋𝑛−1 is the previous number in the sequence

(need B… why? Else we might get a string of zeros)

𝑋1 is the starting point -- the seed.

The possible range is 0 to N-1, the range of the mod operation.

Let's try it in Mathcad. (*prng.mcd)

Choice of parameters affects performance.

Let's start with N = 1000, A = 41, B = 31 (bad ones to try: A = 40,

B = 42)

3.23

Let’s put probability to work… Back to Information Theory.

First, let’s make a distinction between signal, message, and

information.

Signal is the physical means of information transfer

o EM Waves emitted by your iPhone

o Electrical current from the headphone jack

Message is a string of symbols from an alphabet (e.g., text,

010011)

Information is the “thing” carried by the message.

How do we measure it?

Thus far, we’ve only said that the information content is related to

probability. Lower probability connotes higher information

content.

3.24

Information Theory is a systematic way to treat this idea. In IT,

we deal with three basic concepts:

o The measure of information

o The capacity of a channel to transfer information

o The construction of messages to fully utilize the channel

capacity

We’ve already touched on the second and third items when

discussing bandwidth and coding.

We’ll now turn our attention over to the first item – a quantitative

measure of information content…

3.25

The end result of Information Theory is a remarkable one, known

as the Fundamental Theorem of Information Theory.

It’s the reason (behind) why your iPhone (usually) sounds so

good and why your texts and emails come through error-free.

Science of Information Fundamental Tenet V

(Fundamental Theorem of Information Theory):

Given an information source and a communication channel,

there exists a coding technique such that:

1. The information can be transmitted at any rate lower

than the channel capacity.

2. The transmission will have an arbitrarily small number

of errors, even if the channel is noisy.

3.26

In essence, this theorem is promising error-free transmission in

the presence of noise.

Information Theory is thus useful for characterizing and

understanding all information systems – provides answers to

questions like

o How does bandwidth affect information transfer on the

iPhone?

o How does noise affect information transfer on the iPhone?

o What are the characteristics of an ideal smart phone?

o How well does our iPhone 9 measure up? How can it be

improved?

3.27

Information Content

How helpful is each the following statements in determining an

unknown card?

o Card is rectangular p = 1

o Card is red p = 1/2

o Card is heart p = 1/4

o Card is face p = 12/52 = 3/13

o Card is jack p = 4/52 = 1/13

(how do you rank the information content?)

Let IA = information content associated with event A (or message

A) (I is sometimes called the "self information")

IA = f(pA) where pA is the probability of event A and f() is a

function to be determined.

Requirements on f:

o f ≥ 0

o f → 0 as p → 1 (no information if probability of occurrence

is 1)

o If pA < pB, then f(pA) > f(pB) [decreasing function]

o If A and B are independent, then IA and B = IA + IB

3.28

EX: Message 1: “Card is a heart”

Message 2: “Card is a king”

Message 3: Message 1 and Message 2 = Card is king of hearts

Message 4: Card is the suicide king

I3 = I4 since information content is the same. Message 4 and the

combination of Messages 1 and 2 yield the same information.

If IAB = IA + IB (for independent messages/events)

Then IAB = f(pAB) = f(pA) + f(pB)

For independent events, pAB = pApB

Bold statement of the day:

o The only function that satisfies all of these criteria is

IA = f(pA) = logb(1/pA)

b is the base of the logarithm (usually 2 for IT, but any will

satisfy the criteria)

Recall definition of logarithm: 2y = x means y = log2(x)

In our case, x = 1/p, y = I, 2I= 1/p

3.29

Let’s see how this definition works out for the last example.

o Iheart=log2(1/pheart) =

o Iking = log2(1/pking)

o Iheart + Iking =

o Ikingofhearts = log2(1/pkingofhearts)

3.30

Do the four criteria check out?

What does 5.7 units mean?

The name given to the unit of information is the bit.

King of hearts: 5.7 bits of information

o This bit is related to the binary digit…

I A pA( ) log 21

pA

-----( )=

3.31

Origin of 'bit': Suppose we have an unbiased binary choice, like

flipping a coin.

p(H) = 1/2, p(T) = 1/2

How much information is conveyed by a single coin flip?

This question is the simplest, most basic information transfer

question.

It is the smallest 'bit' of information. It is one bit.

IH = log2(1/pH) = log2(2) = 1

IH = 1 bit

IT = 1 bit

If we represent H with 1 and T with 0, each binary digit (bit) has

one bit of information.

bit = {𝑎 𝑏𝑖𝑛𝑎𝑟𝑦 𝑑𝑖𝑔𝑖𝑡

𝑎 𝑢𝑛𝑖𝑡 𝑜𝑓 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛

Paradox: a binary digit might convey less than one bit of

information!

o What if we use 000 to represent H?

o And 111 to represent T? For purposes of redundancy…

3.32

o The message 000111 has 6 bits but only 2 bits of

information! (crucial distinction)

Self-information is not a complete picture of the information

content, since the information system is designed to produce only

certain messages…

Suppose we write the daily weather in Phoenix as a binary

sequence: 1 for rain, 0 for sun

o p(1) = p(Rain) = 0.1, p(0) = p(Sun) = 0.9

o I1 = 3.32, I0 = 0.15

o ex: 000100000101000001000000

o 1s have more information than 0s, but what is the average?

For fair coin toss, intuitively, the average is 1 bit since for both H

and T, we get one bit of information.

Need to describe source of information in terms of average

information.

Science of Information Fundamental Tenet VI:

The average information associated with a source is given by

the entropy.

3.33

Derivation

Suppose the source can transmit M different symbols.

o Minimum text: M = {space, a-z, A-Z, 0-9, 10 punctuation

marks} = about 73

o Digital binary system: M=2

The message has N symbols, any of which can be any of the M

possible symbols.

And we know the probability of each symbol

o p1 = probability for symbol 1

o p2 = probability for symbol 2

o ….

o pM = probability for symbol M

My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.'

Claude Shannon, Scientific American (1971), volume 225, page 180.

3.34

If symbol j is transmitted, it conveys Ij amount of information.

In a long message, the symbol j appears Npj number of times.

Total amount of information in the message = sum of

information conveyed by each symbol in the sequence.

Total information = Np1I1 + Np2I2 + … + NpjIj + … + NpMIM

𝑁 ∑ 𝑝𝑗𝐼𝑗

𝑀

𝑗=1

To find average information, we just divide 𝑁 ∑ 𝑝𝑗𝐼𝑗𝑀𝑗=1 by N.

Let's define entropy of the source.

Entropy H = Total information in message / message length

3.35

M

jjj IpH

1

(units = bits/ symbol)

Expanding, we have:

M

jjj

M

jjj ppppH

12

12 )(log)/1(log

What does it mean? ANS: On average, we receive H bits of

information from each symbol.

Limits on Entropy

Minimum value of H is zero. No information in each bit.

o If one symbol has probability one, then there's no

information and no need to transmit.

Maximum value of H occurs when all symbols are equally likely.

(i.e., pj = 1/M)

3.36

o In this case, )(log)(log1

)/1(log

21

2

12

MMM

ppH

M

j

M

jjj

Binary Source Example

Suppose M = 2

Symbols are 0 and 1

Probabilities are not equal, necessarily.

Consider a scenario where we have 8 cards: 2 black, 6 red

o Black = 0, Red = 1

Possible messages: 00111111, 011111101, 11011101, etc.

Bits do not convey equal information – we expect 1's because red

cards are more likely…

More generally, let p0 = p, p1 = 1 – p

What is the source entropy? (What is the average information per

symbol?)

3.37

))1/(1(log)1()/1(log

)/1(log)/1(log

)/1(log

22

121020

1

02

pppp

pppp

ppHj

jj

So, one binary digit can convey at most one bit of information.

p(black) = 2/8 = 1/4 = p

p(red) = 6/8 = 3/4 = 1-p

Hp( ) p log 21

p---( ) 1 p– log 2

1

1 p–------------( )+=

3.38

H =

If we have entropy of less than one, what does that mean? Does it

mean that using 1's and 0's is overkill? How could we do any

better?

In image processing within the iPhone, this occurrence is quite

common…

3.39

Closeup of Eye:

3.40

31 x 24 pixels = 744 pixels

Could be coded as 744 bits (1's

and 0's)

How many bits of information?

o If black and white were

equally probable, H=1 bit /

pixel

o Using Camera Pro app on the iPhone (or something

similar), we can find that there are 571 black pixels, 173

white pixels

Black and White

31 pixels

24 pixels

3.41

o So, pB = 0.767, pW=0.233

o What's the entropy?

This suggests that we may be able to code this image with

0.783x744=583 bits!

HOW?

Black and White can be coded with 2 symbols

If each level were equally probable, pj = 1/2; H = 1 bit/symbol

Using actual frequencies:

3.42

Huffman Code

The Huffman code is used in making JPEGs (along with the

frequency domain) on the iPhone.

I’ll briefly explain the algorithm in words and equations, BUT you

will understand it after a simple visual example.

Suppose we have an image with pixels having gray level values

from 0 to K’-1. (could be the m and r values in a fax).

Suppose that we also know the probability of each gray level.

• Algorithm: Form a binary tree with branches labeled by the gray-

levels ki and their probabilities p (ki)

(0) Eliminate from consideration any ki where p (ki) = 0.

(1) Find the 2 smallest probabilities pi = p (ki) and pj = p (kj).

(2) Replace by pij = pi + pj (form a node; reduce list by one).

(3) Label the branch for ki with '1' and for kj with '0'.

(4) Until list has only 1 element (root is reached), return to (1).

• In step (3), values '1' and '0' are assigned to element pairs (ki, kj),

elements triples, etc. as the process progresses.

3.43

Example

• Let’s pretend the iPhone only has 8 possible intensities. Their

associate probabilities in one image are:

p (0) = 0.4 p (4) = 0.12

p (1) = 0.08 p (5) = 0.08

p (2) = 0.08 p (6) = 0.04

p (3) = 0.2 p (7) = 0.0

• The process can be represented by a tree, where the values '1' and

'0' are placed on the right and left branches at each stage:

CR = 1.2:1

Example

• There are K´ = 8 values { 0 ,.., 7} to be assigned codewords:

p (0) = 0.4 p (4) = 0.12

p (1) = 0.08 p (5) = 0.08

p (2) = 0.08 p (6) = 0.04

p (3) = 0.2 p (7) = 0.0

• The process can be represented by a tree, where the values '1'

and '0' are placed on the right and left branches at each stage:

k

0.4 0.08 0.08 0.2 0.12 0.08 0.04p (k)D

0 1 2 3 4 5 6

0.12

1 0

1 0

1 0

1 0 1 0

1 0

0.16

0.240.36

0.6

1 0111 0110 010 001 0001 0000

1 4 4 3 3 4 4

kk

kkL( )

BPP = 2.48 bits

Entropy = 2.42 bits CR = 1.2:1

4.22

3.44

Decoding

• The Huffman code is a uniquely decodable code. This means that

there is only one interpretation for a series of codewords, which

is just a series of bits.

• Decoding progresses by traversing the tree.

Example

• In the first example, this sequence of bit values is received:

00010110101110000010000100010110111010

• The bit sequence is examined until a codeword is identified. This

process continues until all are identified:

0001 0110 1 0111 0000010 0001 0001 0110 1 1 1 010

• The decoded sequence (into gray levels) is:

5 2 0 1 6 3 5 5 2 0 0 0 3

3.45

Compression Measures

• Bits Per Pixel (BPP) is the average number of bits required to

store the gray level of each pixel in an image.

• In an non-coded image, BPP = log2(K) = B, where K = the # of

allowable gray levels. Usually

B = log2(256) = 8.

• The number of bits used to code pixels may vary across a coded

image. Let B(i, j) = # of bits used to code pixel I(i, j). Then

BPP =1/(N2)

N - 1

i = 0

N - 1

j = 0

B ( i , j )

For an image with N x N pixels

• If the total number of bits contained in Code[ ] is Btotal, then

BPP = 1/(N2)

Btotal

• Compression Ratio (CR) is the ratio

CR = (B/BPP) > 1

• Both BPP and CR are used frequently.

3.46

Significance of Entropy

BPP = Bits Per Pixel; equivalent to BPS = Bits Per Symbol

There is an important theorem in coding theory that says how

well we can code an image with probabilities p (k) without loss,

by assigning variable wordlengths:

BPP ≥ H [ ]

• A variable wordlength code assumes the probabilities p (k) are

known at both the transmitter and the receiver.

• By a code, we mean a uniquely decodable code.

• This is the best reason for using entropy as a (lossless) coding

measure.

Science of Information Fundamental Tenet VII:

The lower bound (in bits per symbol) for a lossless variable

wordlength coder is given by the entropy.

3.47

• It tells us that an image (originally coded with B bits per pixel)

with a perfectly flat histogram

H [ ] = B

cannot be coded at a reduced BPP by only varying the

wordlengths. Fortunately, we can do some other things first.

• It also tells us that a constant image doesn't even need to be sent!

H [ ] = 0

The Huffman code (remember) is optimal in terms of variable

wordlength (lossless) coding.

The Huffman code is used in JPEG and in MP3, allowing your

iPhone to keep all those pictures and songs!

3.48

Revisiting Huffman Example: the optimal code

• A different 3-bit iPhone image has the following probabilities

computed from the histogram of the grayscale image:

p (0) = 1/2 p (4) = 1/16

p (1) = 1/8 p (5) = 1/32

p (2) = 1/8 p (6) = 1/32

p (3) = 1/8 p (7) = 0

(these are from the image histogram)

Another Example

• There are K´ = 8 values { 0 ,.., 7} to be assigned codewords:

p (0) = 1/2 p (4) = 1/16

p (1) = 1/8 p (5) = 1/32

p (2) = 1/8 p (6) = 1/32

p (3) = 1/8 p (7) = 0

1 011 010 001 0001 00001 00000

1 3 3 3 4 5 5

kk

kkL( )

BPP = Entropy = 2.1875 bits

1/16

1 0

1 0

1 0

1/4

1/8

k

p (k)D

0 1 2 3 4 5 6

1

2

1

8

1

8

1

8

1

16

1

32

1

32

1/4

1 0

1 0

1 0

1/2

1

CR = 1.37:1

What's entropy? We'll see shortly….

4.25

module 3: probability and information theory on the iphoneffh8x/d/soi18f/module03.pdf · 3.2 nexus...

Documents