probability densities in data mining

86
Copyright © Andrew W. Moore Slide 1 Probability Densities in Data Mining Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 412-268-7599

Upload: nina-barlow

Post on 02-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Probability Densities in Data Mining. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 412-268-7599. Contenido. Porque son importantes. Notacion y fundamentos de PDF continuas. PDFs multivariadas continuas. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 1

Probability Densities in Data

MiningAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Page 2: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 2

Contenido• Porque son importantes.• Notacion y fundamentos de PDF

continuas.• PDFs multivariadas continuas.• Combinando variables aleatorias

discretas y continuas.

Page 3: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 3

Porque son importantes?• Real Numbers occur in at least 50% of

database records• Can’t always quantize them• So need to understand how to describe

where they come from• A great way of saying what’s a

reasonable range of values• A great way of saying how multiple

attributes should reasonably co-occur

Page 4: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 4

Porque son importantes?• Can immediately get us Bayes

Classifiers that are sensible with real-valued data

• You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things

• Will introduce us to linear and non-linear regression

Page 5: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 5

A PDF of American Ages in 2000

Page 6: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 6

Poblacion de PR por grupo de edad

group freq midpoint freq.rela0-4 284593 2.5 0.07375165-9 301424 7.5 0.078113310-14 305025 12.5 0.079046515-19 305577 17.5 0.079189520-24 299362 22.5 0.077578925-29 277415 27.5 0.071891430-34 262959 32.5 0.068145235-39 265154 37.5 0.068714040-44 258211 42.5 0.066914745-49 239965 47.5 0.062186350-54 233597 52.5 0.060536155-59 206552 57.5 0.053527460-64 169796 62.5 0.044002265-69 141869 67.5 0.036765070-74 112416 72.5 0.029132375-79 85137 77.5 0.022063080-84 57953 82.5 0.015018485+ 51801 87.5 0.0134241

Page 7: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 7

0 20 40 60 80

0.02

0.04

0.06

0.08

pobpr$midpoint

pobp

r$fre

q.re

lapdf de la edad poblacional en PR en 2000

Page 8: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 8

A PDF of American Ages in 2000

Let X be a continuous random variable.

If p(x) is a Probability Density Function for X then…

b

ax

dxxpbXaP )(

50

30age

age)age(50Age30 dpP

= 0.36

Page 9: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 9

Properties of PDFs

That means…

h

hxX

hxP

xp

22)( lim

0h

b

ax

dxxpbXaP )(

)(xpxXPx

Page 10: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 10

)()()]2/(2/[)()22

(2/

2/

whpwphxhxdttph

xXh

xPhx

hx

Donde x-h/2<w<x+h/2). Luego,

)()2/2/(

wph

hxXhxP

Asi p(w) tiende a p(x) cuando h tiende a cero

Page 11: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 11

x

dttpxXP )()(

h

dttp

h

xXPhxXP

xd

xXdPhx

x

h

)()()(lim

)(

)(0

hxwxwph

whph

),(lim)(

lim0

Notar que p(w) tiende a p(x) cuando h tiende a 0

Funcion de distribucion acumulativa. Esta es una funcion No decreciente

Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad

Page 12: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 12

Properties of PDFs

b

ax

dxxpbXaP )(

)(xpxXPx

Therefore…

Therefore…

1)(

x

dxxp

0)(: xpx

La dcerivada de una fucnion no dcecreciente es mayor o igual que cero.

Page 13: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 13

• Cual es el significado de p(x)?Si

p(5.31) = 0.06 and p(5.92) = 0.03

Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5.31 que a 5.92.

Page 14: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 14

Yet another way to view a PDF

A recipe for sampling a random age.

1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age,d)

2. If d < p(age) stop and return age

3. Else try again: go to Step 1.

Page 15: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 15

Test your understanding• True or False:

1)(: xpx

0)(: xXPx

Page 16: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 16

ExpectationsE[X] = the expected value of random variable X

= the average value we’d see if we took a very large number of random samples of X

x

dxxpx )(

Page 17: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 17

ExpectationsE[X] = the expected value of random variable X

= the average value we’d see if we took a very large number of random samples of X

x

dxxpx )(

= the first moment of the shape formed by the axes and the blue curve

= the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error

E[age]=35.897

Page 18: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 18

Expectation of a function=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution.

= the average value we’d see if we took a very large number of random samples of f(X)

x

dxxpxf )()(

Note that in general:

])[()]([ XEfxfE

64.1786]age[ 2 E

62.1288])age[( 2 E

Page 19: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 19

Variance2 = Var[X] = the expected squared difference between x and E[X]

x

dxxpx )()( 22

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally

02.498]age[Var

Page 20: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 20

Standard Deviation2 = Var[X] = the expected squared difference between x and E[X]

x

dxxpx )()( 22

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally

= Standard Deviation = “typical” deviation of X from its mean

02.498]age[Var

][Var X

32.22

222 )]([)( XEXE

Page 21: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 21

Estadisticas para PR• E(edad)=35.17• Var(edad)=501.16• Desv.Est(edad)=22.38

Page 22: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 22

Funciones de densidad mas conocidas

Page 23: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 23

Funciones de densidad mas conocidas

• La densidad uniforme o rectangular• La densidad triangular• La densidad exponencial• La densidad Gamma y la Chi-square• La densidad Beta• La densidad Normal o Gaussiana• Las densidades t y F.

Page 24: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 24

La distribucion rectangular

-w/2 0 w/2

1/w

0][ XE12

]Var[2w

X

2

w|x|if0

2

w|x|if

1

)( wxp

Page 25: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 25

La distribucion triangular

0

w|x|

w|x|w

xwxp

if0

if||

)( 2

6]Var[

2wX

0][ XE

w

1

w

w

Page 26: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 26

The Exponential distribution

otherwise

xexpx

if0

0if1

)(/

2]Var[ X

][XE

0 20 40 60 80 100

0.0

00

.02

0.0

40

.06

0.0

80

.10

Densidad exponencial,=.1

x

0.1

* e

xp

(-0

.1 *

x)

Page 27: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 27

La distribucion Normal

Estandar

2exp

2

1)(

2xxp

1]Var[ X

0][ XE

Page 28: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 28

La distribucion Normal General

2

2

2

)(exp

2

1)(

x

xp

2]Var[ X

μXE ][

=100

=15

Page 29: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 29

General Gaussian

2

2

2

)(exp

2

1)(

x

xp

2]Var[ X

μXE ][

=100

=15

Shorthand: We say X ~ N(,2) to mean “X is distributed as a Gaussian with parameters and 2”.

In the above figure, X ~ N(100,152)

Also known as the normal

distribution or Bell-shaped curve

Page 30: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 30

The Error FunctionAssume X ~ N(0,1)

Define ERF(x) = P(X<x) = Cumulative Distribution of X

x

z

dzzpxERF )()(

x

z

dzz

2exp

2

1 2

Page 31: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 31

Using The Error FunctionAssume X ~ N(,2)

P(X<x| ,2) = )(2x

ERF

Page 32: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 32

The Central Limit Theorem• If (X1,X2, … Xn) are i.i.d. continuous

random variables• Then define

• As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi]

Somewhat of a justification for assuming Gaussian noise is common

n

iin x

nxxxfz

121

1),...,(

Page 33: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 33

Estimadores de funcion de densidad

• Histograms

• K-nearest neighbors:

• Kernel density estimators

knd

kf

2)(ˆ x

nh

kxf )(ˆ h ancho de clase

dk es la distancia hasta el k-esimo vecino

Page 34: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 34

Estimacion de funcion de densidad-histograma

x

Den

sity

0 2 4 6 8

0.00

0.05

0.10

0.15

0.20

0.25

> x=c( 7.3, 6.8, 7.1, 2.5, 7.9, 6.5, 4.2, 0.5, 5.6, 5.9)> hist(x,freq=F,main="Estimacion de funcion de densidad-histograma")> rug(x,col=2)

Page 35: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 35

Estimacion de densidad por knn en 20 pts con k=1,3,5,7

0 2 4 6 8

01

23

4

x

fest

0 2 4 6 8

0.1

0.3

x

fest

0 2 4 6 8

0.05

0.20

x

fest

0 2 4 6 8

0.05

0.15

0.25

x

fest

Page 36: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 36

Estimación por kernels de una función de densidad univariada.

En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x1,…xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por

el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que

n

xobsxFn

#)(

Page 37: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 37

donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2h. La ecuación anterior puede ser escrita como:

donde la función peso K está definida por  0 si |z|>1 K(z)=

1/2 si |z| 1

h

hxFhxFxf nn

2

)()()(ˆ

)(1

)(ˆ1

n

i

i

h

xxK

nhxf

Page 38: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 38

Muestra: 6, 8, 9 12, 20, 25,18, 31

hhhdepuntosenproporcionf 2/)15,15()15(ˆ

32/164/28/)19,11()15(ˆ depuntosenporporcionf

]02/1002/1000)[4*8/(1)15(ˆ f

Page 39: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 39

este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como:

    K(z) debe ser acotado y absolutamente

integrable en (-,)       Usualmente, pero no siempre, K(z)0 y

simétrico, luego cualquier función de densidad simétrica puede usarse como kernel.

1)( dzzK

0)(lim

nhn

Page 40: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 40

Eleccion del ancho de banda h

2.006.1 snh

Donde n es el numero de datos y s la desviacion estandar de la muestra.

2.0

13 )(79.0 nQQh

Page 41: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 41

EL KERNEL GAUSSIANO  

En este caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir,

)2

1exp(

2

1)( 2zzK

n

i

h

xx i

e

nhxf

1

2

)(

2

1)(ˆ

2

2

Page 42: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 42

EL KERNEL TRIANGULAR K(z)=1- |z| para |z|<1, 0 en otro caso.

EL KERNEL "BIWEIGHT"

15/16(1-z2)2 para |z|<1

K(z)=

0 en otro caso

Page 43: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 43

EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso

5

)5

1(54

3 2z

Page 44: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 44

Estimacion de densidad en 20 pts usando kernel gaussiano con h=.5,”opt1”,”opt2”, 4

0 2 4 6 8

0.00

0.15

0.30

x

fest

0 2 4 6 8

0.02

0.08

0.14

x

fest

0 2 4 6 8

0.05

0.15

x

fest

0 2 4 6 8

0.04

0.07

x

fest

Page 45: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 45

Variables aleatorias bidimensionales

p(x,y) = probability density of random

variables (X,Y) at location (x,y)

Page 46: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 46

Estimadores de funcion de densidad bi-dimensionales

• Histogramas

• K-nearest neighbors:

• Kernel density estimators

knA

kxf )(ˆ

nA

kxf )(ˆ A area de la clase

Ak es el area ncluyendo hasta el k-esimo vecino

Page 47: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 47

Estimacion de kernel bivariado

)||||

(1

)(ˆ1

2

n

i

i

hK

nhf

xtt

))]()'[((1

)(ˆ 2/11

121

ii

n

iHK

hnhf xtxtt

Sean xi=(x,y) los valores observados y t=(t1,t2) un punto del plano donde se desea estimar la densidad conjunta

2

2

2

1

0

0

h

hH

Si h1=h2=h

Page 48: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 48

• (a1, a2)H-12

2

2

2

2

2

1 ||||

2

1

h

a

h

aa

a

a

2

2

1

/10

0/1

h

hH

donde

Page 49: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 49

Estimacion de densidad-Kernel Gaussiano bivariado

n

i

h

ty

h

tx

e

hnhf

1

2

)(

2

)(

21 2

1)(ˆ

22

22

21

21

t

Page 50: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 50

10

20

30

40

2000

3000

4000

50000 e+00

1 e-05

2 e-05

3 e-05

densidad conjunta estimada por metodo kernel

f1= kde2d(autompg1$V1, autompg1$V5,n=100)persp(f1$x,f1$y,f1$z)

Page 51: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 51

mpg

wei

ght

10 20 30 40

1500

2500

3500

4500

grafica de contorno de la densidad estimada

contour(f1, levels=c(8e-6,2e-5, 2.8e-5),col=c(2,3,4), xlab="mpg",ylab="weight")

Page 52: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 52

In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((

Page 53: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 53

In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((

P( 20<mpg<30 and 2500<weight<3000) =

volumen under the 2-d surface within the red rectangle

Page 54: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 54

In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((

P( [(mpg-25)/10]2 + [(weight-3300)/1500]2

< 1 ) =

volumen under the 2-d surface within the red oval

Page 55: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 55

In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((

Take the special case of region R = “everywhere”.

Remember that with probability 1, (X,Y) will be drawn from “somewhere”.

So..

x y

dydxyxp 1),(

Page 56: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 56

In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((

20h

2222lim h

hyY

hy

hxX

hxP

),( yxp

Page 57: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 57

In m dimensions

Let (X1,X2,…Xm) be an n-tuple of continuous random variables, and let R be some region of Rm …

)),...,,(( 21 RXXXP m

Rxxx

mm

m

dxdxdxxxxp),...,,(

1221

21

,,...,),...,,(...

Page 58: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 58

Independence

If X and Y are independent then knowing the value of X does not help predict the

value of Y

)()(),( :yx, iff ypxpyxpYX

mpg,weight NOT independent

Page 59: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 59

Independence

If X and Y are independent then knowing the value of X does not help predict the

value of Y

)()(),( :yx, iff ypxpyxpYX

the contours say that acceleration and weight

are independent

Page 60: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 60

Multivariate Expectation

xxxXμX dpE )(][

E[mpg,weight] =(24.5,2600)

The centroid of the cloud

Page 61: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 61

Multivariate Expectation> f1= kde2d(autompg1$mpg, autompg1$weight,n=100)> dx=f1$x[2]-f1$x[1]> dy=f1$y[2]-f1$y[1]> dx[1] 0.379798> dy[1] 35.62626> meanmpg=sum(f1$x*f1$z)*dx*dy[1] 22.48855> meanweight=sum(f1$y*f1$z)*dx*dy[1] 2848.638>#estimated mean> mean(autompg1$weight)[1] 2977.584> mean(autompg1$mpg)[1] 23.44592

Page 62: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 62

Multivariate Expectation

xxxX dpffE )()()]([

Page 63: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 63

Test your understanding? ][][][ does ever) (if When :Question YEXEYXE

•All the time? Siempre

•Only when X and Y are independent?

•It can fail even if X and Y are independent?

Page 64: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 64

Bivariate Expectation

dydxyxpxXE ),(][

dydxyxpyxfyxfE ),(),()],([

dydxyxpyYE ),(][

dydxyxpyxYXE ),()(][

][][][ YEXEYXE

Page 65: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 65

Bivariate Covariance)])([(],Cov[ yxxy YXEYX

])[(][],Cov[ 22xxxx XEXVarXX

])[(][],Cov[ 22yyyy YEYVarYY

Page 66: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 66

Bivariate Covariance)])([(],Cov[ yxxy YXEYX

])[(][],Cov[ 22xxxx XEXVarXX

])[(][],Cov[ 22yyyy YEYVarYY

then, Write

Y

X X

yxy

xyxTxx ))((E 2

2

][] [

ΣμXμXXCov

Page 67: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 67

Covarianza y desviacion estandar estimadas entre mpg y weight

> cov(autompg1[,c(1,5)]) mpg weightmpg 60.91814 -5517.441weight -5517.44070 721484.709> sd(autompg1$mpg)[1] 7.805007> sd(autompg1$weight)[1] 849.4026

Page 68: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 68

Covariance Intuition

E[mpg,weight] =(24.5,2600)

8mpg 8mpg

700weight

700weight

Page 69: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 69

Covariance Intuition

E[mpg,weight] =(24.5,2600)

8mpg 8mpg

700weight

700weight

PrincipalEigenvector

of

Page 70: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 70

Regression Line

)()/(2 x

X

xy

y xxXYE

Notice that the regression line pass trough (x,y)

Page 71: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 71

Regression Line>l1=lm(weight~mpg,data=autompg1)> l1

Call:lm(formula = weight ~ mpg, data = autompg1)

Coefficients:(Intercept) mpg 5101.11 -90.57

>#slope of regression line>slope= -5517.44/60.918[1] -90.571

Page 72: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 72

Primer Principal component> a=cov(autompg1[,c(1,5)])> eigen(a)$values[1] 721526.90386 18.72329

$vectors [,1] [,2][1,] -0.007647317 0.999970759[2,] 0.999970759 0.007647317

#slope of primer principal component

> .99997/-.00764[1] –130.8861

Page 73: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 73

Covariance Fun Facts

yxy

xyxTxx ))((E 2

2

][] [

ΣμXμXXCov

•True or False: If xy = 0 then X and Y are independent. False

•True or False: If X and Y are independent then xy = 0. True

•True or False: If xy = x y then X and Y are deterministically related. True

•True or False: If X and Y are deterministically related then xy = x y.

false

How could you prove or disprove these?

Page 74: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 74

Test your understanding? ][][][ does ever) (if When :Question YVarXVarYXVar

•All the time?

•Only when X and Y are independent? Cierto

•It can fail even if X and Y are independent?

Page 75: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 75

Marginal Distributions

y

dyyxpxp ),()(

Page 76: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 76

Conditional Distributions

yYX

yxp

when of p.d.f.

)|(

)4600weight|mpg( p

)3200weight|mpg( p

)2000weight|mpg( p

Page 77: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 77

Conditional Distributions

yYX

yxp

when of p.d.f.

)|(

)4600weight|mpg( p

)(

),()|(

yp

yxpyxp

Why?

Page 78: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 78

Independence Revisited

It’s easy to prove that these statements are equivalent…

)()(),( :yx, iff ypxpyxpYX

)()|( :yx,

)()|( :yx,

)()(),( :yx,

ypxyp

xpyxp

ypxpyxp

Page 79: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 79

More useful stuff

BayesRule

(These can all be proved from definitions on previous slides)

1)|(

x

dxyxp

)|(

)|,(),|(

zyp

zyxpzyxp

)(

)()|()|(

yp

xpxypyxp

Page 80: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 80

Mixing discrete and continuous variables

h

vAh

xXh

xPvAxp

22),( lim

0h

1),(1

An

v x

dxvAxp

BayesRule

BayesRule)(

)()|()|(

AP

xpxAPAxp

)(

)()|()|(

xp

APAxpxAP

Page 81: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 81

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

x

fest

clase 1

clase 2

Estimacion de funcion de dendidad conjunta mixta

P(educacion,salario>50k)

Page 82: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 82

5 10 15

0.4

0.6

0.8

1.0

1:16

b[, 1

]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

class 1

class 2

conditional density estimation de educacion por clase

Estimation of the posterior P(Class/Education)

Page 83: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 83

Mixing discrete and continuous variables

P(EduYears,Wealthy)

Page 84: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 84

Mixing discrete and continuous variables

P(EduYears,Wealthy)

P(Wealthy| EduYears)

Page 85: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 85

Mixing discrete and continuous variables

Renorm

aliz

ed

Axes

P(EduYears,Wealthy)

P(Wealthy| EduYears)

P(EduYears|Wealthy)

Page 86: Probability Densities in Data Mining

Copyright © Andrew W. Moore Slide 86

Ejercicios

• Suppose X and Y are independent real-valued random variables distributed between 0 and 1:• What is p[min(X,Y)]? • What is E[min(X,Y)]?

• Prove that E[X] is the value u that minimizes E[(X-u)2]

• What is the value u that minimizes E[|X-u|]?