probability densities in data mining

Copyright © Andrew W. Moore Slide 1

Probability Densities in Data

MiningAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599


Contenido• Porque son importantes.• Notacion y fundamentos de PDF

continuas.• PDFs multivariadas continuas.• Combinando variables aleatorias

discretas y continuas.


Porque son importantes?• Real Numbers occur in at least 50% of

database records• Can’t always quantize them• So need to understand how to describe

where they come from• A great way of saying what’s a

reasonable range of values• A great way of saying how multiple

attributes should reasonably co-occur


Porque son importantes?• Can immediately get us Bayes

Classifiers that are sensible with real-valued data

• You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things

• Will introduce us to linear and non-linear regression


A PDF of American Ages in 2000


Poblacion de PR por grupo de edad

group freq midpoint freq.rela0-4 284593 2.5 0.07375165-9 301424 7.5 0.078113310-14 305025 12.5 0.079046515-19 305577 17.5 0.079189520-24 299362 22.5 0.077578925-29 277415 27.5 0.071891430-34 262959 32.5 0.068145235-39 265154 37.5 0.068714040-44 258211 42.5 0.066914745-49 239965 47.5 0.062186350-54 233597 52.5 0.060536155-59 206552 57.5 0.053527460-64 169796 62.5 0.044002265-69 141869 67.5 0.036765070-74 112416 72.5 0.029132375-79 85137 77.5 0.022063080-84 57953 82.5 0.015018485+ 51801 87.5 0.0134241


0 20 40 60 80

0.02

0.04

0.06

0.08

pobpr$midpoint

pobp

r$fre

q.re

lapdf de la edad poblacional en PR en 2000


A PDF of American Ages in 2000

Let X be a continuous random variable.

If p(x) is a Probability Density Function for X then…

b

ax

dxxpbXaP )(

50

30age

age)age(50Age30 dpP

= 0.36


Properties of PDFs

That means…

h

hxX

hxP

xp

22)( lim

0h

b

ax

dxxpbXaP )(

)(xpxXPx


)()()]2/(2/[)()22

(2/

2/

whpwphxhxdttph

xXh

xPhx

hx

Donde x-h/2<w<x+h/2). Luego,

)()2/2/(

wph

hxXhxP

Asi p(w) tiende a p(x) cuando h tiende a cero


x

dttpxXP )()(

h

dttp

h

xXPhxXP

xd

xXdPhx

x

h

)()()(lim

)(

)(0

hxwxwph

whph

),(lim)(

lim0

Notar que p(w) tiende a p(x) cuando h tiende a 0

Funcion de distribucion acumulativa. Esta es una funcion No decreciente

Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad


Properties of PDFs

b

ax

dxxpbXaP )(

)(xpxXPx

Therefore…

Therefore…

1)(

x

dxxp

0)(: xpx

La dcerivada de una fucnion no dcecreciente es mayor o igual que cero.


• Cual es el significado de p(x)?Si

p(5.31) = 0.06 and p(5.92) = 0.03

Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5.31 que a 5.92.


Yet another way to view a PDF

A recipe for sampling a random age.

1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age,d)

2. If d < p(age) stop and return age

3. Else try again: go to Step 1.


Test your understanding• True or False:

1)(: xpx

0)(: xXPx


ExpectationsE[X] = the expected value of random variable X

= the average value we’d see if we took a very large number of random samples of X

x

dxxpx )(


ExpectationsE[X] = the expected value of random variable X

= the average value we’d see if we took a very large number of random samples of X

x

dxxpx )(

= the first moment of the shape formed by the axes and the blue curve

= the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error

E[age]=35.897


Expectation of a function=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution.

= the average value we’d see if we took a very large number of random samples of f(X)

x

dxxpxf )()(

Note that in general:

])[()]([ XEfxfE

64.1786]age[ 2 E

62.1288])age[( 2 E


Variance2 = Var[X] = the expected squared difference between x and E[X]

x

dxxpx )()( 22

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally

02.498]age[Var


Standard Deviation2 = Var[X] = the expected squared difference between x and E[X]

x

dxxpx )()( 22

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally

= Standard Deviation = “typical” deviation of X from its mean

02.498]age[Var

][Var X

32.22

222 )]([)( XEXE


Estadisticas para PR• E(edad)=35.17• Var(edad)=501.16• Desv.Est(edad)=22.38


Funciones de densidad mas conocidas


Funciones de densidad mas conocidas

• La densidad uniforme o rectangular• La densidad triangular• La densidad exponencial• La densidad Gamma y la Chi-square• La densidad Beta• La densidad Normal o Gaussiana• Las densidades t y F.


La distribucion rectangular

-w/2 0 w/2

1/w

0][ XE12

]Var[2w

X

2

w|x|if0

2

w|x|if

1

)( wxp


La distribucion triangular

0

w|x|

w|x|w

xwxp

if0

if||

)( 2

6]Var[

2wX

0][ XE

w

1

w

w


The Exponential distribution

otherwise

xexpx

if0

0if1

)(/

2]Var[ X

][XE

0 20 40 60 80 100

0.0

00

.02

0.0

40

.06

0.0

80

.10

Densidad exponencial,=.1

x

0.1

* e

xp

(-0

.1 *

x)


La distribucion Normal

Estandar

2exp

2

1)(

2xxp

1]Var[ X

0][ XE


La distribucion Normal General

2

2

2

)(exp

2

1)(

x

xp

2]Var[ X

μXE ][

=100

=15


General Gaussian

2

2

2

)(exp

2

1)(

x

xp

2]Var[ X

μXE ][

=100

=15

Shorthand: We say X ~ N(,2) to mean “X is distributed as a Gaussian with parameters and 2”.

In the above figure, X ~ N(100,152)

Also known as the normal

distribution or Bell-shaped curve


The Error FunctionAssume X ~ N(0,1)

Define ERF(x) = P(X<x) = Cumulative Distribution of X

x

z

dzzpxERF )()(

x

z

dzz

2exp

2

1 2


Using The Error FunctionAssume X ~ N(,2)

P(X<x| ,2) = )(2x

ERF


The Central Limit Theorem• If (X1,X2, … Xn) are i.i.d. continuous

random variables• Then define

• As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi]

Somewhat of a justification for assuming Gaussian noise is common

n

iin x

nxxxfz

121

1),...,(


Estimadores de funcion de densidad

• Histograms

• K-nearest neighbors:

• Kernel density estimators

knd

kf

2)(ˆ x

nh

kxf )(ˆ h ancho de clase

dk es la distancia hasta el k-esimo vecino


Estimacion de funcion de densidad-histograma

x

Den

sity

0 2 4 6 8

0.00

0.05

0.10

0.15

0.20

0.25

> x=c( 7.3, 6.8, 7.1, 2.5, 7.9, 6.5, 4.2, 0.5, 5.6, 5.9)> hist(x,freq=F,main="Estimacion de funcion de densidad-histograma")> rug(x,col=2)


Estimacion de densidad por knn en 20 pts con k=1,3,5,7

0 2 4 6 8

01

23

4

x

fest

0 2 4 6 8

0.1

0.3

x

fest

0 2 4 6 8

0.05

0.20

x

fest

0 2 4 6 8

0.05

0.15

0.25

x

fest


Estimación por kernels de una función de densidad univariada.

En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x1,…xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por

el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que

n

xobsxFn

#)(


donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2h. La ecuación anterior puede ser escrita como:

donde la función peso K está definida por 0 si |z|>1 K(z)=

1/2 si |z| 1

h

hxFhxFxf nn

2

)()()(ˆ

)(1

)(ˆ1

n

i

i

h

xxK

nhxf


Muestra: 6, 8, 9 12, 20, 25,18, 31

hhhdepuntosenproporcionf 2/)15,15()15(ˆ

32/164/28/)19,11()15(ˆ depuntosenporporcionf

]02/1002/1000)[4*8/(1)15(ˆ f


este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como:

K(z) debe ser acotado y absolutamente

integrable en (-,) Usualmente, pero no siempre, K(z)0 y

simétrico, luego cualquier función de densidad simétrica puede usarse como kernel.

1)( dzzK

0)(lim

nhn


Eleccion del ancho de banda h

2.006.1 snh

Donde n es el numero de datos y s la desviacion estandar de la muestra.

2.0

13 )(79.0 nQQh


EL KERNEL GAUSSIANO

En este caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir,

)2

1exp(

2

1)( 2zzK

n

i

h

xx i

e

nhxf

1

2

)(

2

1)(ˆ

2

2


EL KERNEL TRIANGULAR K(z)=1- |z| para |z|<1, 0 en otro caso.

EL KERNEL "BIWEIGHT"

15/16(1-z2)2 para |z|<1

K(z)=

0 en otro caso


EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso

5

)5

1(54

3 2z


Estimacion de densidad en 20 pts usando kernel gaussiano con h=.5,”opt1”,”opt2”, 4

0 2 4 6 8

0.00

0.15

0.30

x

fest

0 2 4 6 8

0.02

0.08

0.14

x

fest

0 2 4 6 8

0.05

0.15

x

fest

0 2 4 6 8

0.04

0.07

x

fest


Variables aleatorias bidimensionales

p(x,y) = probability density of random

variables (X,Y) at location (x,y)


Estimadores de funcion de densidad bi-dimensionales

• Histogramas

• K-nearest neighbors:

• Kernel density estimators

knA

kxf )(ˆ

nA

kxf )(ˆ A area de la clase

Ak es el area ncluyendo hasta el k-esimo vecino


Estimacion de kernel bivariado

)||||

(1

)(ˆ1

2

n

i

i

hK

nhf

xtt

))]()'[((1

)(ˆ 2/11

121

ii

n

iHK

hnhf xtxtt

Sean xi=(x,y) los valores observados y t=(t1,t2) un punto del plano donde se desea estimar la densidad conjunta

2

2

2

1

0

0

h

hH

Si h1=h2=h


• (a1, a2)H-12

2

2

2

2

2

1 ||||

2

1

h

a

h

aa

a

a

2

2

1

/10

0/1

h

hH

donde


Estimacion de densidad-Kernel Gaussiano bivariado

n

i

h

ty

h

tx

e

hnhf

1

2

)(

2

)(

21 2

1)(ˆ

22

22

21

21

t


10

20

30

40

2000

3000

4000

50000 e+00

1 e-05

2 e-05

3 e-05

densidad conjunta estimada por metodo kernel

f1= kde2d(autompg1$V1, autompg1$V5,n=100)persp(f1$x,f1$y,f1$z)


mpg

wei

ght

10 20 30 40

1500

2500

3500

4500

grafica de contorno de la densidad estimada

contour(f1, levels=c(8e-6,2e-5, 2.8e-5),col=c(2,3,4), xlab="mpg",ylab="weight")


In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

P( 20<mpg<30 and 2500<weight<3000) =

volumen under the 2-d surface within the red rectangle


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

P( [(mpg-25)/10]2 + [(weight-3300)/1500]2

< 1 ) =

volumen under the 2-d surface within the red oval


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

Take the special case of region R = “everywhere”.

Remember that with probability 1, (X,Y) will be drawn from “somewhere”.

So..

x y

dydxyxp 1),(


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

20h

2222lim h

hyY

hy

hxX

hxP

),( yxp


In m dimensions

Let (X1,X2,…Xm) be an n-tuple of continuous random variables, and let R be some region of Rm …

)),...,,(( 21 RXXXP m

Rxxx

mm

m

dxdxdxxxxp),...,,(

1221

21

,,...,),...,,(...


Independence

If X and Y are independent then knowing the value of X does not help predict the

value of Y

)()(),( :yx, iff ypxpyxpYX

mpg,weight NOT independent


Independence

If X and Y are independent then knowing the value of X does not help predict the

value of Y


the contours say that acceleration and weight

are independent


Multivariate Expectation

xxxXμX dpE )(][

E[mpg,weight] =(24.5,2600)

The centroid of the cloud


Multivariate Expectation> f1= kde2d(autompg1$mpg, autompg1$weight,n=100)> dx=f1$x[2]-f1$x[1]> dy=f1$y[2]-f1$y[1]> dx[1] 0.379798> dy[1] 35.62626> meanmpg=sum(f1$x*f1$z)*dx*dy[1] 22.48855> meanweight=sum(f1$y*f1$z)*dx*dy[1] 2848.638>#estimated mean> mean(autompg1$weight)[1] 2977.584> mean(autompg1$mpg)[1] 23.44592


Multivariate Expectation

xxxX dpffE )()()]([


Test your understanding? ][][][ does ever) (if When :Question YEXEYXE

•All the time? Siempre

•Only when X and Y are independent?

•It can fail even if X and Y are independent?


Bivariate Expectation

dydxyxpxXE ),(][

dydxyxpyxfyxfE ),(),()],([

dydxyxpyYE ),(][

dydxyxpyxYXE ),()(][

][][][ YEXEYXE


Bivariate Covariance)])([(],Cov[ yxxy YXEYX

])[(][],Cov[ 22xxxx XEXVarXX

])[(][],Cov[ 22yyyy YEYVarYY


Bivariate Covariance)])([(],Cov[ yxxy YXEYX

])[(][],Cov[ 22xxxx XEXVarXX

])[(][],Cov[ 22yyyy YEYVarYY

then, Write

Y

X X

yxy

xyxTxx ))((E 2

2

][] [

ΣμXμXXCov


Covarianza y desviacion estandar estimadas entre mpg y weight

> cov(autompg1[,c(1,5)]) mpg weightmpg 60.91814 -5517.441weight -5517.44070 721484.709> sd(autompg1$mpg)[1] 7.805007> sd(autompg1$weight)[1] 849.4026


Covariance Intuition

E[mpg,weight] =(24.5,2600)

8mpg 8mpg

700weight

700weight


Covariance Intuition

E[mpg,weight] =(24.5,2600)

8mpg 8mpg

700weight

700weight

PrincipalEigenvector

of


Regression Line

)()/(2 x

X

xy

y xxXYE

Notice that the regression line pass trough (x,y)


Regression Line>l1=lm(weight~mpg,data=autompg1)> l1

Call:lm(formula = weight ~ mpg, data = autompg1)

Coefficients:(Intercept) mpg 5101.11 -90.57

>#slope of regression line>slope= -5517.44/60.918[1] -90.571


Primer Principal component> a=cov(autompg1[,c(1,5)])> eigen(a)$values[1] 721526.90386 18.72329

$vectors [,1] [,2][1,] -0.007647317 0.999970759[2,] 0.999970759 0.007647317

#slope of primer principal component

> .99997/-.00764[1] –130.8861


Covariance Fun Facts

yxy

xyxTxx ))((E 2

2

][] [

ΣμXμXXCov

•True or False: If xy = 0 then X and Y are independent. False

•True or False: If X and Y are independent then xy = 0. True

•True or False: If xy = x y then X and Y are deterministically related. True

•True or False: If X and Y are deterministically related then xy = x y.

false

How could you prove or disprove these?


Test your understanding? ][][][ does ever) (if When :Question YVarXVarYXVar

•All the time?

•Only when X and Y are independent? Cierto

•It can fail even if X and Y are independent?


Marginal Distributions

y

dyyxpxp ),()(


Conditional Distributions

yYX

yxp

when of p.d.f.

)|(

)4600weight|mpg( p

)3200weight|mpg( p

)2000weight|mpg( p


Conditional Distributions

yYX

yxp

when of p.d.f.

)|(

)4600weight|mpg( p

)(

),()|(

yp

yxpyxp

Why?


Independence Revisited

It’s easy to prove that these statements are equivalent…


)()|( :yx,

)()|( :yx,

)()(),( :yx,

ypxyp

xpyxp

ypxpyxp


More useful stuff

BayesRule

(These can all be proved from definitions on previous slides)

1)|(

x

dxyxp

)|(

)|,(),|(

zyp

zyxpzyxp

)(

)()|()|(

yp

xpxypyxp


Mixing discrete and continuous variables

h

vAh

xXh

xPvAxp

22),( lim

0h

1),(1

An

v x

dxvAxp

BayesRule

BayesRule)(

)()|()|(

AP

xpxAPAxp

)(

)()|()|(

xp

APAxpxAP


0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

x

fest

clase 1

clase 2

Estimacion de funcion de dendidad conjunta mixta

P(educacion,salario>50k)


5 10 15

0.4

0.6

0.8

1.0

1:16

b[, 1

]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

class 1

class 2

conditional density estimation de educacion por clase

Estimation of the posterior P(Class/Education)



P(EduYears,Wealthy)



P(EduYears,Wealthy)

P(Wealthy| EduYears)



Renorm

aliz

ed

Axes

P(EduYears,Wealthy)

P(Wealthy| EduYears)

P(EduYears|Wealthy)


Ejercicios

• Suppose X and Y are independent real-valued random variables distributed between 0 and 1:• What is p[min(X,Y)]? • What is E[min(X,Y)]?

• Prove that E[X] is the value u that minimizes E[(X-u)2]

• What is the value u that minimizes E[|X-u|]?

probability densities in data mining

Documents