poll everywhere test - penn engineeringcis520/lectures/info_gain.pdf · slide 1 linear regression...
TRANSCRIPT
![Page 1: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/1.jpg)
Slide 1
Linear regression isA) ParametricB) Non-parametric
K-NN isA) ParametricB) Non-parametric
Poll Everywhere Test
• There are lots of office hours!!!!
![Page 2: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/2.jpg)
Slide 2
Decision Treesand Information Theory
Lyle UngarUniversity of Pennsylvania
![Page 3: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/3.jpg)
What symptom tells you most about the disease?
S1 S2 S3 Dy n n yn y y yn y n nn n n ny y n y
A) S1B) S2C) S3
Why?
![Page 4: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/4.jpg)
What symptom tells you most about the disease?
S1/D S2/D s3/Dy n y n y n
y 2 0 y 2 1 y 1 0n 1 2 n 1 1 n 2 2 A) S1
B) S2C) S3
Why?
![Page 5: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/5.jpg)
If you know S1=n, what symptom tells you most about the disease?
S1 S2 S3 Dy n n yn y y yn y n nn n n ny y n y
A) S1B) S2C) S3
Why?
![Page 6: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/6.jpg)
Resulting decision treeS1
y/ \nD S3
y/ \nD ~ D
The key question: what criterion to use do decide which question to ask?
![Page 7: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/7.jpg)
Slide 7Copyright © 2001, 2003, Andrew W. Moore
Entropy and Information Gain
Andrew W. MooreCarnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599 modified by Lyle Ungar
![Page 8: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/8.jpg)
Copyright © 2001, 2003, Andrew W. Moore
BitsYou observe a set of independent random samples of X
You see that X has four possible values
So you might see: BAACBADCDADDDA…
You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11)0100001001001110110011111100…
P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4
![Page 9: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/9.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Fewer BitsSomeone tells you that the probabilities are not equal
It is possible to invent a coding for your transmission that only
uses 1.75 bits on average per symbol. How?
P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
![Page 10: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/10.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Fewer BitsSomeone tells you that the probabilities are not equal
It is possible to invent a coding for your transmission that only
uses 1.75 bits on average per symbol. How?
(This is just one of several ways)
P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8
A 0B 10C 110D 111
![Page 11: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/11.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Fewer BitsSuppose there are three equally likely values…
Here’s a naïve coding, costing 2 bits per symbol
Can you think of a coding that only needs1.6 bits per symbol on average?
In theory, it can in fact be done with 1.58496 bits per symbol.
P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3
A 00B 01C 10
![Page 12: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/12.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Suppose X can have one of m values… V1, V2, … Vm
What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution?
It is
H(X) = The entropy of X• �High Entropy� means X is from a uniform (boring) distribution• �Low Entropy� means X is from varied (peaks and valleys) distribution
General Case: Entropy
mm ppppppXH 2222121 logloglog)( ----= !
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
å=
-=m
jjj pp
12log
![Page 13: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/13.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Suppose X can have one of m values… V1, V2, … Vm
What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X�s distribution? It’s
H(X) = The entropy of X• �High Entropy� means X is from a uniform (boring) distribution• �Low Entropy� means X is from varied (peaks and valleys) distribution
General Case
mm ppppppXH 2222121 logloglog)( ----= !
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
å=
-=m
jjj pp
12log
A histogram of the frequency distribution of values of X would be flat
A histogram of the frequency distribution of values of X would have many lows and one or
two highs
![Page 14: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/14.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Suppose X can have one of m values… V1, V2, … Vm
What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X�s distribution? It’s
H(X) = The entropy of X• �High Entropy� means X is from a uniform (boring) distribution• �Low Entropy� means X is from varied (peaks and valleys) distribution
General Case
mm ppppppXH 2222121 logloglog)( ----= !
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
å=
-=m
jjj pp
12log
A histogram of the frequency distribution of values of X would be flat
A histogram of the frequency distribution of values of X would have many lows and one or
two highs..and so the values
sampled from it would be all over the place
..and so the values sampled from it would be
more predictable
![Page 15: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/15.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Entropy in a nut-shell
Low Entropy High Entropy
![Page 16: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/16.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Entropy in a nut-shell
Low Entropy High Entropy..the values (locations of soup) unpredictable...
almost uniformly sampled throughout our dining room
..the values (locations of soup) sampled
entirely from within the soup bowl
![Page 17: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/17.jpg)
Why does entropy have this form?
å=
-=m
jjj pp
12log
mm ppppppXH 2222121 logloglog)( ----= !
If an event is certain, the entropy isA) 0B) between 0 and ½C) ½D) between ½ and 1E) 1
Entropy is the expected value of the information content (surprise) of the message log2pj
![Page 18: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/18.jpg)
Why does entropy have this form?
å=
-=m
jjj pp
12log
mm ppppppXH 2222121 logloglog)( ----= !
If two events are equally likely, the entropy isA) 0B) between 0 and ½C) ½D) between ½ and 1E) 1
![Page 19: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/19.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Specific Conditional Entropy H(Y|X=v)Suppose I’m trying to predict output Y and I have input X
Let’s assume this reflects the true probabilities
e.g. From this data we estimate• P(LikeG = Yes) = 0.5• P(Major = Math & LikeG = No) = 0.25• P(Major = Math) = 0.5• P(LikeG = Yes | Major = History) = 0
Note:• H(X) = 1.5•H(Y) = 1
X = College Major
Y = Likes �Gladiator�
X YMath YesHistory NoCS YesMath NoMath NoCS YesHistory NoMath Yes
![Page 20: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/20.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Definition of Specific Conditional Entropy:H(Y |X=v) = The entropy of Y among only those records in which X has value v
X = College MajorY = Likes �Gladiator�
X YMath YesHistory NoCS YesMath NoMath NoCS YesHistory NoMath Yes
Specific Conditional Entropy H(Y|X=v)
![Page 21: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/21.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Definition of Specific Conditional Entropy:H(Y |X=v) = The entropy of Y among only those records in which X has value vExample:• H(Y|X=Math) = 1• H(Y|X=History) = 0• H(Y|X=CS) = 0
X = College MajorY = Likes �Gladiator�
X YMath YesHistory NoCS YesMath NoMath NoCS YesHistory NoMath Yes
Specific Conditional Entropy H(Y|X=v)
![Page 22: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/22.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Conditional Entropy H(Y|X)Definition of Conditional Entropy:
H(Y |X) = The average specific conditional entropy of Y
If you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X
= Expected number of bits to transmit Y if both sides will know the value of X
= Σj Prob(X=vj) H(Y | X = vj)
X = College Major
Y = Likes �Gladiator�
X YMath YesHistory NoCS YesMath NoMath NoCS YesHistory NoMath Yes
![Page 23: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/23.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Conditional EntropyDefinition of Conditional Entropy:
H(Y|X) = The average conditional entropy of Y
= ΣjProb(X=vj) H(Y | X = vj)
X = College Major
Y = Likes �Gladiator�
Example:vj Prob(X=vj) H(Y | X = vj)
Math 0.5 1
History 0.25 0
CS 0.25 0
H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
X YMath YesHistory NoCS YesMath NoMath NoCS YesHistory NoMath Yes
![Page 24: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/24.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Information GainDefinition of Information Gain:
IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X?IG(Y|X) = H(Y) - H(Y |X)
X = College MajorY = Likes �Gladiator�
Example:• H(Y) = 1• H(Y|X) = 0.5• Thus IG(Y|X) = 1 – 0.5 = 0.5
X YMath YesHistory NoCS YesMath NoMath NoCS YesHistory NoMath Yes
![Page 25: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/25.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Information Gain Example
![Page 26: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/26.jpg)
Copyright © 2001, 2003, Andrew W. Moore
Another example
![Page 27: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/27.jpg)
What is Information Gain used for?
If you are going to collect information from someone (e.g. asking questions sequentially in a decision tree), the “best” question is the one with the highest information gain.
Information gain is useful for model selection (later!)
![Page 28: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/28.jpg)
What question did we not ask (or answer) about decision trees?
![Page 29: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/29.jpg)
What you should know• Entropy• Information Gain• The standard decision tree algorithm
• Recursive partition trees• Also called: ID3/C4.5/CART/CHAID
![Page 30: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/30.jpg)
How is my speed?• A) Slow• B) Good• C) Fast
![Page 31: Poll Everywhere Test - Penn Engineeringcis520/lectures/info_gain.pdf · Slide 1 Linear regression is A) Parametric B) Non-parametric K-NN is A) Parametric B) Non-parametric Poll Everywhere](https://reader034.vdocument.in/reader034/viewer/2022042321/5f0ba1a67e708231d43174ca/html5/thumbnails/31.jpg)
What one thing• Do you like about the course so far?• Would you improve about the course so far?