estimaci - uab barcelonasct.uab.cat/.../sct.uab.cat.estadistica/files/slidesuab.pdf · 2011. 2....
Post on 25-Sep-2020
1 Views
Preview:
TRANSCRIPT
��
��
Estimaci� no param�trica de densitat i regressi� amb l��s de gr��cs interactius
Servei d�Estad�stica� UAB� Desembre ����
Frederic Udina �udina�upf�es�
Web page� http���gauss�upf�es
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Esquema
� I� Estimaci� de la densitat� Histogrames i similars
L�amplada o el nombre de cel�les
La posici� de l��ncora
Variants de l�histograma
Gr��cs interactius� t�cniques de programaci�
� II� Estimaci� de densitats� m�todes nucli
L�elecci� del nucli
L�elecci� de l�ample de �nestra
L�elecci� interactiva exploratria
Ample variable
� III� Estimaci� de regressi� per m�todes nucli
Localment constant o polinmica
Ample variable
Estimaci� no param�trica� � � � � F� Udina� UAB ������
��
��
To make a histogram
Given a data set xi� i � �� � � � �N one must choose�
� An origin for the bin edges b� the anchor
� A bin width h �Or a number of bins and the range or some estimate of it�
� A method of binning�
counting� nj � fi j xi � �bj� bj���g
linear binning���
b�� h and the counts � � � � �� c�� c�� � � � � cK� �� � � � determine the histogram�
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Nombre de columnes o ample de banda
� Quin objectiu� Descripci�� Estimaci� de f�x��
� Regla de Sturges� ncols � �� log� n
Es basa en la comparaci� d�un histograma amb el grc de la binomial
Adequat per tant per dades normals
� Per determinar l�ample h que minimitza �assimptticament� el MISE de l�histograma
�fnh com a estimador de la funci� de densitat f�x� l�hauriem de con�ixer�
h�
� ���R�f������n����� on R�� �Z
�
�MISE � EZ
��fn�h � f���
Per podem plug�in la normal en la f�rmula i tenim
Ample de banda �assimptticament� ptim amb refer�ncia a la normal
�h � ��sn����
Versi� m�s robusta
�h � � IQ n����
Per f prou llisa es t�
h � hOS � ���� IQ n����
Aix recomana ncols � �p
�n
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
An example� how many modes�
Data are number of visas issued by
the U�S� Immigration and Natural�
ization Service in ���� for the pur�
pose of adoption by U�S� residents
for �� countries or regions of origin�
Number are logged base �� because
data are very long�tailed� Bin�
width is the same h � ��� � for
all three histograms but anchor po�
sition take three di�erent values�
43210
43210
43210
Estimaci� no param�trica� � � � � F� Udina� UAB ������
��
��
Changing the anchor position
Asymptotically only the bin width counts� anchor position has no e�ect when n grows
But in practice we work with �nite samples and then anchor position DO count�
By changing the anchor or moving the origin we mean�
Take b� � min�xi� and consider the histograms Ht� t � ��� ��
determined by b� � th� h and the corresponding counts�
Then the problem can be formulated�
Are all these histograms Ht similar�
In Simono��Udina ������ we de�ned an index G to measure the similarity of those
histograms that is the stability of a given histogram when the anchor position changes�
G ranges from � �very unstable� to � �very stable��
We devised a parametric bootstrap procedure to assess the value of the index of a given
�real data� histogram� In practice G � ���� means stable�
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Some simulations
Gaussian distribution�
Average of stability index for ���
samples�
Sample sizes ��� ���� ����
In our simulatons we conclude
that the more structure a dis�
tribution has the more unstable
are histograms�
Vertical line is h�
Horizontal unit is oversmoothed
choice approx ��N�����
N=20
Bin width as proportion of oversmoothed choice
Sta
bilit
y in
dex
0.2 0.4 0.6 0.8 1.0
0.80
0.85
0.90
0.95
N=100
N=500
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
A real data example
For a data set of N � � countries
�log�� of� number of visa per coun�
try issued in the U�S� we computed
the stability index for a range of bin
widths�
Horizontal axis� h
Vertical axis� G
�Sturges� ��� � ROT� �����
0.300 0.500 0.700 0.900 0.100 0.6
64
Log-visas data
0.7
59 0
.853
0.9
48
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Real data examples � �
Data are ages in years of the ���
players in the NBA who played
the guard position during the
��������� season�
The estimated optimal bin�width
would be about ���
Note that data are integer val�
ues so natural bin�width will be
� or � but most statistical pack�
ages don�t care about problems
like rounded �or even truncated��
data�
Bin width
Sta
bilit
y in
dex
0.5 1.0 1.5 2.0 2.5
0.7
0.8
0.9
1.0
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Better histograms than the Histogram�
People use Histograms because they are simple to build and easy to interpret� They can
be used as �poor� data descriptors but they should never be used as density estimators�
Frequency polygons are better density es�
timators but have the same anchor position
problem� They join bin frequencies at the mid�
points of the bins� Using linear binning instead
of counting improves stability�
Edge frequency polygons as introduced by
Jones et al� ���� have even better asymp�
totic properties as density estimators and are
more stable against anchor changes� Average
frequency polygons join over each bin edge the
average of the frequencies of the two adjacent
bins�
0.000 1.500 3.000 4.500 6.000
0.000 1.500 3.000 4.500 6.000
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
Stability of histograms� freq�polyg� and edge fp
0.75
0.8
0.85
0.9
0.95
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Gin
i-sta
bilit
y In
dex
Bin width
Three frequency polygons compared, geyser durations
"Mean-freq FP""Linearly binned FP"
"Regular FP"
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
Stability of histograms� freq�polyg� and edge fp � II
0.353 0.597 0.841 1.086 0.8
00
0.109
0.8
60
Normal frequency polygon or histogram
0.9
20 0
.980
Average 350 samples, Normal Dist. N=50
Linearly binned frequency polygon
Mean-frequency polygon
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Programming interactive graphics
Computation �ow can be
Program driven
START
Init
MENU
��� Input data
��� Compute this
��� Graph that
��� Change params
��� Exit
Ask for params� � �
do it� � �
END
User GUI driven
� Program reacts to events� mouse
clicks menu choices show�hide a win�
dow show and read some dialog box
� � �
� Operating system sends events like this
window needs to be redrawn �it has
been uncovered��
� User can do �almost� anything at any
time�
� Output is multiple and very complex�
text user customizable graphics ani�
mated graphics � � �
� Multiple windows need to be up to date
any time�
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
Goals
� Flow of the computation is automatically driven� so the programmer need not write
repetitive parts of the program to control what quantities must be computed or are up to
date at a given moment�
� The user has freedom to modify the values under his or her control at any moment�
� Only needed quantities are computed and then stored to avoid re�computation until
they must change�
Basic idea
� Describe the computing �ow by a directed graph �no cicles are allowed��
� An arrow going from some quantity to another means that any change in the origin implies
that the destination must be updated �i�e� the origin quantity is involved in the computation
of the �nal one��
� Circled input or parameters user modi�able�
� Squared output� usually graphical�
I� A� A� O�
A� A� O�
A�I� A�
� Changes go FORWARD When any quantity is changed� those that follow it are marked as
�obsolete�
� Computation goes BACKWARD when any quantity is needed� its recomputed asking for its
precedents� that will be recomputed if they are �obsolete�
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
Implementation
Any language is suitable but the ideal language must have�
� Statistical and graphical tools
� Object oriented programming�
Objects have
� slots like local variables that contain each one of the quantities �input
intermediate output��
� methods like functions or procedures owned by the object� There will be a
method for each quantity�
� Symbolic capabilities
XLISP�STAT has it all� It�s free and available for Unix Windows Mac�
XLISP�STAT web site� http���www�stat�umn�edu� luke�xls�xlsinfo�
Estimaci� no param�trica� � � � � � F� Udina� UAB ������
��
��
The rules to be followed in programming the computation are�
� De�ne the graph G to drive the computation �ow and translate it to a
dependency�tree� this is a list formed by items in the form �a� b�� � � � � bk� where bi
are all quantities that directly depend on a and there is one and only one such item
for every quantity a � G� except fo the leaves�
� Changes to a slot are always done through the corresponding method� This method
should call �propagate�changes�
� This speci�c method �propagate�changes is used to mark all the slots that depend
on the one being changed with the speci�c symbol obsolete�
� The same method when called with no arguments returns the value for the slot
unless it is obsolete in which case it is recomputed stored and returned�
This way all the variables of interest are contained in slots of an object and they will
always be accessed by means of an accessor method�
These methods take automatic care of the computation �ow� Constructing them can be
also automated by macros in XLISP�STAT�
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
A full example� piecewise linear density estimators
Interest is interactive display of histograms and related �better� estimators� They are all
based in binning data and drawing segments�
� Histogram �or hollow histogram�
� Frequency polygon
� Edge frequency polygons
� Piecewise linear estimator
These were the output main input
are�� Data
� Bin width and anchor position or
shift
� What lines are to be shown ver�
tical scale etc�
data
data�summary
x�range
bin�edges
half�cnts pieclin
bin�cnts all�lines
long�cnts
stab�index
density
dens�lines
box�plot�lines
bw�ends
scale�estimate
bin�width
anchor�base anchor�shift
what�to�show
y�scaleWuD
Note WuD� Window is up to date can be true or obsolete
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
II� Kernel Density Estimation
Given a data set �Xi� i � �� � � � � n�
we assume i�i�d� and common density function f�x��
We choose
� a kernel function K�x� �typically a simetric density function�
� and a bandwidth h
The estimate is de�ned by
�f�x� �
�nh
nXi��
K�
x� Xi
h
��
�n
nXi��
Kh�x� Xi�
where Kh��� � K���h��h denotes a rescaling of K�
Akaike� ����� Rosenblatt� ����� Parzen ���
DevroyeGy�rfi� ����� WandJones� ���� Simonoff� ���
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
What does it mean�
Given a n � � data set
a Gaussian kernel function
and some bandwidth �h �
����� the estimation is built
by adding up probability
masses�
-1.400 -0.700 0.000 0.700 1.400
Estimaci� no param�trica� � � � ��� F� Udina� UAB ������
��
��
Choice of kernel
The choice of the kernel function has no great importance for the performance of the
estimator�
It determines its properties� continuity di�eren�
tiability etc�
Some popular choices�
Name K�t�
Uniform ��� � �������
Triangular ��� jtj��
Bartlett�Epanechnikov �� ��� t���
Biweight ����� ��� t����
Triweight ��� ��� t����
Gaussian e������t�
�p
��
-1.400 -0.700 0.000 0.700 1.400
-1.400 -0.700 0.000 0.700 1.400
-1.400 -0.700 0.000 0.700 1.400
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Choice of bandwidth
What do we want� Description� At which resolution� To estimate f�
Choice of h has great e�ect on the
performance of the estimation�
Large bandwidth ��
small error variance big bias�
Small bandwidth ��
big error variance small bias�
-1.183 1.133 3.450 5.767 8.083
Parzen�s ��� �� optimal bandwidth �f smooth and n���
h�
��
R�K�
���K��R�f ���
n
����
minimizes the asympt� MISE �i�e� ER
�f� �fnh����
where R�g� �R
g�� ���g� �R
x�g�x�dx
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Better kernel choice� Canonical rescaling
Switching from a kernel K to a rescaled version K is the same as changing the bandwidth�
To avoid mixing e�ects Marron and Nolan ���� canonical suggested rescaling�
Given a kernel �x a rescaled version K such thatR
K�x��dx ��R
x�K�x�dx��
�
Working with canonically
rescaled kernels makes choice
of kernel and bandwidth really
independent� Otherwise they
are not�
The same bandwidth gives the
same amount of smoothing for
any canonically rescaled kernel�
Here h � �� � kernel uniform
triangular gaussian respectively
all canonically rescaled�
0.000 1.750 3.500 5.250 7.000 0.0
00 0
.133
0.2
67 0
.400
KDE instance
0.000 1.750 3.500 5.250 7.000 0.0
00 0
.133
0.2
67 0
.400
KDE instance
0.000 1.750 3.500 5.250 7.000 0.0
00 0
.133
0.2
67 0
.400
KDE instance
Estimaci� no param�trica� � � � � F� Udina� UAB ������
��
��
Automatic bandwidth selection
There has been a lot of work in this area in the last two decades� The main issue is what
requirements we impose on the unknown density f� To mention but a few�
� Normal based rule of thumb fast� simple
� Least squares cross�validation old� obsolete
� Park�Marron Plug�in method good performance
� Sheather�Jones solve the equation method the �best
� Devroye�Lugosi universal selector any f any Rd
Normal based rule of thumb
It is the more simple the fastest to compute�
In Parzen�s formula the unknown density f only appears in R�f ��� the rule�of�thumb
replaces it by a normal density� We need to estimate the scale� For example using the
interquartile range � one has
hROT � �� ��n����
It does�nt give good estimates for asimetric kurtotic multi�modal etc� � � underlying
densities�
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Interactive choice of the bandwidth
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5 4
Den
sity
Scaled Income
British Income Data
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Local bandwidth
The bandwidth can be the same over the estimation range
or it can vary depending on
the estimation point x�
bfhx�x� � n��
nXi��
Kh�x��x� Xi�
the data points�
bfhi�x� � n��
nXi��
Kh�Xi��x� Xi�
Anyway the main problem is to determine a good bandwidth function h R �� �����
Abramson� ���� Devroye� ����� TerrellScott� ���
DevroyeLugosi� ���� show that there is no way to optimally choose it automatically
based only on the data without knowing the target density function�
We propose interactive determination of the function as a data analysis tool�
Estimaci� no param�trica� � � � � F� Udina� UAB ������
��
��
Interactive choice of a bandwidth function
We allow the analyst to draw the bandwidth function by dragging some points with a mouse
��gure� lower part��
Each part of the estimate uses a di�erent
bandwidth value� the small kernel functions
visualize it ��gure� upper part��
Main problems to solve
� How to interpolate the knots
� Help the analyst to choose a good func�
tion
� Computational di�culty we have a
binned updating algorithm for �both
kinds of� variable bandwidth
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
III� Local polynomial regression
The simplest example is the model
Y � ��x� � �
The local linear estimator is com�
puted in every point x via a weighted
linear regression using only the
nearby points and weighting them
according some kernel function�
Stone� ��
Cleveland� �� �
Fan� ���� ����� Fan�Gijbels
book
0
0.5
1
1.5
2
2.5
3
3.5
0 0.2 0.4 0.6 0.8 1
Local linear estimation
datam^(x)m(x)
kernel weightslinear regression
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
Residuals window example
Data and smoother
Bandwidth function
Residuals window
c)
0.005 0.753 1.500 2.248 2.995-1.80
2-0.
414
0.97
3 2.
361
0.005 0.753 1.500 2.248 2.995
0.005 0.753 1.500 2.248 2.995-1.66
4-0.
089
1.48
7 3.
062
a)
b)
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
��
��
References �visit http���gauss�upf�es for papers and software
Simonoff� J�S�� Udina� F�� ������ !Measuring the stability of histogram appearance
when the anchor position is changed"� Computational Statistics and data analysis ��
��������
Marron� J�S�� Udina� F� ������ !Interactive Local Bandwidth Choice"� Statistics and
Computing � ��������
Udina� F� ������ !Implementing interactive computing in an object�oriented
environment" Journal of Statistical Software ��� �����
�http���www�jstatsoft�org�v���i����
Devroye� L� ���� � Universal smoothing factor selection in density estimation� theory
and practice� Test ��������
Scott� D� W� ����� Multivariate Density Estimation� theory� practice and
visualization� John Wiley New York�
Simonoff� J� S� ���� � Smoothing methods in Statistics� Springer�Verlag New York�
Fan� J�� Gijbels� I� ������ Local Polynomial Modelling and Its Application � Theory
and Methodologies�� Chapman and Hall New York�
Estimaci� no param�trica� � � � �� F� Udina� UAB ������
top related