the pdf estimation problem - rice ujjy5/probtheory.pdf · estimation the pdf estimation problem....
TRANSCRIPT
The PDF Estimation ProblemScientific Computing and Numerical Analysis Seminar
October 5, 2010
The PDF Estimation Problem
Outline
The Big PictureBasic Probability TheoryHermite Polynomial InterpolationHistogram InterpolationKernel EstimationData Regeneration
The PDF Estimation Problem
The Big Picture
Continuum-Microscopic Method Steps1 Create a microscopic system2 Run the microscopic updating scheme for a
short number of time steps3 Average the results and send these values
to the macro-scale4 Run the macroscopic updating scheme
The PDF Estimation Problem
The Big Picture
Goal: Perform Step 1 of the CM Algorithm byutilizing past information from themicro-scaleTrack the evolution of the microscopicvariables by tracking their probabilitydistribution functions (PDFs)Use these PDFs to predict the PDF of eachvariable at the desired future point in time
The PDF Estimation Problem
Probability Theory
The Probability Distribution Function (PDF)A random variable X is defined by its set ofpossible values Ω and its probabilitydistribution function f (X )
The probability that X takes on a valuebetween x and x + dx is given by∫ x+dx
x f (x)dxf (x) is such that its integral normalizes to 1∫
Ω
f (X )dX = 1
The PDF Estimation Problem
Probability Theory
ExpectationThe expected value (or mean) of a probability distributionfunction is given by:
E(X ) =
∫ ∞−∞
xf (x)dx
More generally, the expectation of any function g(X),related to the PDF f(X) is given by:
E(g(X )) =
∫ ∞−∞
g(x)f (x)dx
If f (X ) is unknown, the expectation can be approximatedby taking the average from the given data values:
E(g(X )) ≈∑N
i=1 g(Xi)
N
The PDF Estimation Problem
Probability Theory
The Cumulative Distribution Function (CDF)CDF is defined as:
F (x) =
∫ x
−∞f (X )dX
In words, F (x) represents the probability thatX takes on a value between −∞ to xThe CDF will be useful for DataRegeneration
The PDF Estimation Problem
Probability Theory
Joint Probability Distribution Function (JPDF)Given random variables X1,X2, ...XN , theJPDF f (X1,X2, ...,XN) can be interpreted asthe probability that X1 ∈ (x1, x1 + dx1),X2 ∈ (x2, x2 + dx2), ... , XN ∈ (xN , xN + dxN)is given by:∫ x1+dx1
x1
∫ x2+dx2
x2
...
∫ xN+dxN
xN
f (x1, x2, ..., xN)dx1dx2..dxN
The JPDF can be written as a product ofsingle variable PDFs (f (x1) ∗ f (x2)... ∗ f (xN))if the variables are independent
The PDF Estimation Problem
Probability Theory
Common Distribution Functions
Uniform Distribution: f (x) =
1
b−a for a ≤ x ≤ b,0 for x < a or x > b
Normal Distribution: f (x) = 1√2πσ2 e
−(x−µ)2
2σ2
Uniform Normal
The PDF Estimation Problem
Probability Theory
The PDF Estimation ProblemClassic problem of Probability TheoryGiven a set of data, the goal is to determinethe PDF f (X ) that produced that dataCommon techniques: Series Expansions,Histogram Interpolation, and KernelEstimation
The PDF Estimation Problem
Probability Theory
The PDF Estimation ProblemEach technique will be tested on a set ofdata produced by a normal distribution,mean = 0, standard deviation = 1
The PDF Estimation Problem
Probability Theory
Error EstimationError for each technique will be estimated bycomputing:
E((f (x)− ˆf (x))2)
The Mean Square Error (MSE)E indicates Expectation or average
E((f (x)− ˆf (x))2) ≈ 1n
n∑i=1
(f (xn)− ˆf (xn)
)2
The PDF Estimation Problem
Hermite Polynomial Expansion
Goal is to estimate the underlying PDF f (x)
f (x) could be approximated by a truncatedseries expansion:
f (x) =N∑
n=0
cnHn(x)
where cn are coefficients and Hn(x) are a setof basis functionsFor this demonstration, we choose Hn(x) tobe the orthogonal Hermite polynomials
The PDF Estimation Problem
Hermite Polynomial Expansion
The Hermite polynomials are defined as:
Hn(x) = (−1)nex2 dn
dxn e−x2
The Hermite polynomials are orthogonal on(−∞,∞), meaning:∫ ∞−∞
Hm(x)Hn(x)e−x2dx =
0 if m 6= nn!2n√π m = n
The orthogonality of Hn(x) will allow for easycomputation of the cn coefficients
The PDF Estimation Problem
Hermite Polynomial Interpolation
ˆf (x) =N∑
n=0
cnHn(x)
ˆf (x)Hm(x)e−x2=
N∑n=0
cnHn(x)Hm(x)e−x2
∫ ∞−∞
ˆf (x)Hm(x)e−x2=
∫ ∞−∞
N∑n=0
cnHn(x)Hm(x)e−x2
∫ ∞−∞
ˆf (x)Hm(x)e−x2=
N∑n=0
∫ ∞−∞
cnHn(x)Hm(x)e−x2
The PDF Estimation Problem
Hermite Polynomial Expansion
∫ ∞−∞
ˆf (x)Hn(x)e−x2= cnn!2n√π
cn =1
n!2n√π
∫ ∞−∞
ˆf (x)Hn(x)e−x2
cn =1
n!2n√π
E(Hn(x)e−x2)
cn =1
n!2n√π
∑Ni=1 Hn(xi)e−x2
i )
N
The PDF Estimation Problem
Hermite Polynomial Expansion
Results for different numbers of terms in the Expansion:
Terms: 6, 10, 20, 40
The PDF Estimation Problem
Hermite Polynomial Expansion
Number of Terms MSE6 0.0337410 0.0010920 0.0029140 0.01914
ˆf (x) does poor at the edges of the domainErrors due to truncation of termsApproximation theory says error betweenf (x) and
∑Nn=0 cnHn(x) should decrease as
N increases if cn are computed exactly
The PDF Estimation Problem
Histogram Interpolation
One of the oldest, most common PDFestimation techniquesFirst step is to establish the bins into whichdata will be sortedGiven a starting point x0 and bin width h, thebins can be established as:
[x0 + mh, x0 + (m + 1)h]
The histogram gets defined as:
ˆf (x) =1
nh(No. of Xi in same bin as x)
The PDF Estimation Problem
Histogram Interpolation
ˆf (x) is a piecewise constant estimate of theunderlying PDF f (x)
If a continuous function approximation isneeded, ˆf (x) can be interpolated (e.g.splines)Choice of bin endpoints and width will createdifferent resultsWide bins: Smooth and blur details in dataNarrow bins: Not enough data per bin,resulting approximation very spiky
The PDF Estimation Problem
Histogram Interpolation
Results for different bin widths: h = 0.8, 0.5
The PDF Estimation Problem
Histogram Interpolation
Results for different bin widths: h = 0.2, 0.05
The PDF Estimation Problem
Histogram Interpolation
Bin Width MSE0.8 3.045e-50.5 1.589e-50.2 3.753e-5
0.05 2.094e-4
Optimal bin width can be found by solving anerror minimization problem (provided f (X ) isknown)Formulas exist to estimate optimal bin widthfor data that is close to normally distributedExample: Sturges’ formula: k = log2 n + 1,where k is the number of bins
The PDF Estimation Problem
Kernel Estimation
Another very popular PDF estimationtechniqueSimilar to histograms, but instead of creatingseparate bins into which data is collectedand counted, ˆf (X ) is computed as a sum offunctions centered at each data point
ˆf (x) =1
nh
n∑i=1
K(
x − Xi
h
)
h is still a "width" parameter, and n is thenumber of data points Xi
The PDF Estimation Problem
Kernel Estimation
The "kernel" function K is usually a symmetric probabilitydistribution function, like a normal distribution
K(
x − Xi
h
)=
1√2πh2
e−12
(x−Xi
h
)2
ˆf (X ) is a sum of normal distributions
The PDF Estimation Problem
Kernel Estimation
ˆf (X ) is a smooth, differentiable functionDo not need to choose where to center bins,here bins are centered at each data pointand overlap with one anotherAs in histogram interpolation, there arevarious methods for choosing hMethods include: Minimization of meansquare error (which requires knowledge off (X )) and others such as least squarescross-validation
The PDF Estimation Problem
Kernel Estimation
Results for test case, different h values:
h = 0.5, 0.2, 0.1, 0.05The PDF Estimation Problem
Kernel Estimation
Bin Width MSE0.5 2.983e-40.2 1.896e-50.1 2.345e-5
0.05 5.110e-5
The estimated function ˆf (X ) is a smoothfunction, only requires a choice of bin width,and has low errorsConclusions: Kernel Estimation will be usedto estimate the PDFs of data frommicroscopic variables in the CM model
The PDF Estimation Problem
Data Regeneration
Given a PDF, we need to generate a set ofdata from it, to assign to the elements orparticles in the micro-systemThis is done by computing the cumulativedensity function F (x) =
∫ x−∞ f (X )dX
The values of F (x) range from 0 to 1A random number generator is used to picka value c ∈ [0,1]
A root-finding algorithm is then used to solveF (x)− c = 0 for x (the desired data point)
The PDF Estimation Problem
Data Regeneration
PDF→ CDF→ Data Set
The PDF Estimation Problem
Summary
Kernel Estimation will be used to estimatePDFs of various variables of the microscopicsystemThese PDFs will be collected over timeduring the microscopic evolutionA new PDF at the desired future point in timewill be extrapolated from these saved PDFsA new micro-system will be created at thefuture point in time based on these predictedPDFs
The PDF Estimation Problem
References
Silverman, B.W. "Density Estimation for Statistics and DataAnalysis", Chapman and Hall, 1986.
The PDF Estimation Problem
Seminar Speakers
We need volunteers to give a talk at thisseminar for the following dates:October 20, 27 and November 3, 10
The PDF Estimation Problem