different kind of distance and statistical distance
Post on 09-Feb-2017
23 Views
Preview:
TRANSCRIPT
WELCOME TO MY PRESENTATION
ON STATISTICAL DISTANCE
Md. Menhazul AbedinM.Sc. Student
Dept. of StatisticsRajshahi UniversityMob: 01751385142
Email: menhaz70@gmail.com
Objectives
• To know about the meaning of statistical distance and it’s relation and difference with general or Euclidean distance
Content Definition of Euclidean distance Concept & intuition of statistical distance Definition of Statistical distance Necessity of statistical distance Concept of Mahalanobis distance (population
&sample) Distribution of Mahalanobis distance Mahalanobis distance in RAcknowledgement
Euclidean Distance from origin
(0,0)
(X,Y)
X
Y
Euclidean Distance
P(X,Y) Y O (0,0) X By Pythagoras =
Euclidean Distance
Specific point
we see that two specific points in each picture
Our problem is to determine the length between two points .
But how ??????????
Assume that these pictures are placed in two dimensional spaces and points are joined by a straight line
Let 1st point is (,) and 2nd point is () then distance is
D= )
What will be happen when dimension is three
Distanse in
Distance is given by
• Points are (x1,x2,x3) and (y1,y2,y3)
For n dimension it can be written as the following expression and
named as Euclidian distance
2222
211
2121
)()()(),(
),,,(),,,,(
pp
pp
yxyxyxQPd
yyyQxxxP
05/01/2023 14
Properties of Euclidean Distance and Mathematical Distance
• Usual human concept of distance is Eucl. Dist.• Each coordinate contributes equally to the distance
2222
211
2121
)()()(),(
),,,(),,,,(
pp
pp
yxyxyxQPd
yyyQxxxP
14
Mathematicians, generalizing its three properties ,
1) d(P,Q)=d(Q,P).
2) d(P,Q)=0 if and only if P=Q and
3) d(P,Q)=<d(P,R)+d(R,Q) for all R, define distance
on any set.
P(X1,Y1) Q(X2,Y2)
R(Z1,Z2))
R(Z1,Z2)
Taxicab Distance :Notion Red: Manhattan distance.
Green: diagonal, straight-
line distance
Blue, yellow: equivalent Manhattan distances.
• The Manhattan distance is the simple sum of the horizontal and vertical components, whereas
the diagonal distance might be computed by applying the Pythagorean Theorem .
• Red: Manhattan distance.• Green: diagonal, straight-line distance.• Blue, yellow: equivalent Manhattan distances.
• Manhattan distance 12 unit
• Diagonal or straight-line distance or Euclidean distance is =6 We observe that Euclidean distance is less than Manhattan distance
Taxicab/Manhattan distance :Definition
(p1,p2))
(q1,q2)│𝑝1−𝑞2│
│p2-q2│
Manhattan Distance
• The taxicab distance between (p1,p2) and (q1,q2) is │p1-q1│+│p2-q2│
Relationship between Manhattan & Euclidean distance.
7 Block
6 Block
Relationship between Manhattan & Euclidean distance.
• It now seems that the distance from A to C is 7 blocks, while the distance from A to B is 6 blocks.
• Unless we choose to go off-road, B is now closer to A than C.
• Taxicab distance is sometimes equal to Euclidean distance, but otherwise it is greater than Euclidean distance.
Euclidean distance <Taxicab distanceIs it true always ???Or for n dimension ???
Proof……..
Absolute values guarantee non-negative value
Addition property of inequality
Continued………..
Continued………..
For high dimension
• It holds for high dimensional case • Σ │ Σ │ + 2Σ│Which implies Σ││
05/01/2023
Statistical Distance• Weight coordinates subject to a great deal of
variability less heavily than those that are not highly variable
Who is nearer to
data set if it were
point?
Same distance from
origin
• Here
variability in x1 axis > variability in x2 axis Is the same distance meaningful from
origin ??? Ans: noBut, how we take into account the different variability ????Ans : Give different weights on axes.
05/01/2023
Statistical Distance for Uncorrelated Data
22
22
11
212*
22*
1
222*2111
*1
21
),(
/,/
)0,0(),,(
sx
sxxxPOd
sxxsxx
OxxP
weight
Standardization
all point that have coordinates (x1,x2) and are a constant squared distance , from the origin must satisfy =But … how to choose c ????? It’s a problem Choose c as 95% observation fall in this area ….
= >
05/01/2023
Ellipse of Constant Statistical Distance for Uncorrelated Data
11sc 11sc
22sc
22sc
x1
x2
0
• This expression can be generalized as ……… statistical distance from an arbitrary point P=(x1,x2) to any fixed point Q=(y1,y2)
;lk;lk; For P dimension……………..
Remark : 1) The distance of P to the origin O is obtain by setting all 2) If all are equal Euclidean distance formula is appropriate
Scattered Plot for Correlated Measurements
• How do you measure the statistical distance of the above data set ??????
• Ans : Firstly make it uncorrelated .
• But why and how………???????
• Ans: Rotate the axis keeping origin fixed.
05/01/2023
Scattered Plot for Correlated Measurements
Rotation of axes keeping origin fixed
O M R X1
N Q
~𝑥1
P(x1,x2)x2
~𝑥2
𝜃
𝜃
x=OM =OR-MR =cos– sin…. (i) y=MP =QR+NP = sin cos……….(ii)
• The solution of the above equations
Choice of
What will you choice ? How will you do it ?
Data matrix → Centeralized data matrix → Covariance of data matrix → Eigen vector
Theta = angle between 1st eigen vector and [1,0] or angle between 2nd eigen vector and [0,1]
Why is that angle between 1st eigen vector and [0,1] or angle between 2nd eigen vector and [1,0] ?? Ans: Let B be a (p by p) positive definite matrix with eigenvalues λ1λ2λ3λp> and associated normalized eigenvectors .Then attained when x= attained when x=
attained when x=
Choice of #### Excercise 16.page(309).Heights in inches (x) & Weights in pounds(y). An Introduction to Statistics and Probability M.Nurul Islam ####### x=c(60,60,60,60,62,62,62,64,64,64,66,66,66,66,68,68,68,70,70,70);xy=c(115,120,130,125,130,140,120,135,130,145,135,170,140,155,150,160,175,180,160,175);y ############V=eigen(cov(cdata))$vectors;Vas.matrix(cdata)%*%Vplot(x,y)
data=data.frame(x,y);dataas.matrix(data)colMeans(data)xmv=c(rep(64.8,20));xmv ### x mean vector ymv=c(rep(144.5,20));ymv ### y mean vector meanmatrix=cbind(xmv,ymv);meanmatrixcdata=data-meanmatrix;cdata ### mean centred data plot(cdata) abline(h=0,v=0)
cor(cdata)
• ##################
cov(cdata)
eigen(cov( cdata))
xx1=c(1,0);xx1
xx2=c(0,1);xx2
vv1=eigen(cov(cdata))$vectors[,1];vv1
vv2=eigen(cov(cdata))$vectors[,2];vv2
################theta = acos( sum(xx1*vv1) / ( sqrt(sum(xx1 * xx1)) * sqrt(sum(vv1 * vv1)) ) );theta
theta = acos( sum(xx2*vv2) / ( sqrt(sum(xx2 * xx2)) * sqrt(sum(vv2 * vv2)) ) );theta
###############xx=cdata[,1]*cos( 1.41784)+cdata[,2]*sin( 1.41784);xxyy=-cdata[,1]*sin( 1.41784)+cdata[,2]*cos( 1.41784);yyplot(xx,yy)abline(h=0,v=0)
V=eigen(cov(cdata))$vectors;Vtdata=as.matrix(cdata)%*%V;tdata ### transformed datacov(tdata)round(cov(tdata),14)cor(tdata)plot(tdata)abline(h=0,v=0)round(cor(tdata),16)
• ################ comparison of both method ############
comparison=tdata - as.matrix(cbind(xx,yy));comparisonround(comparison,4)
########### using package. md from original data #####
md=mahalanobis(data,colMeans(data),cov(data),inverted =F);md ## md =mahalanobis distance
######## mahalanobis distance from transformed data ######## tmd=mahalanobis(tdata,colMeans(tdata),cov(tdata),inverted =F);tmd
###### comparison ############ md-tmd
Mahalanobis distance : Manually mu=colMeans(tdata);muincov=solve(cov(tdata));incovmd1=t(tdata[1,]-mu)%*%incov%*%(tdata[1,]-mu);md1md2=t(tdata[2,]-mu)%*%incov%*%(tdata[2,]-mu);md2md3=t(tdata[3,]-mu)%*%incov%*%(tdata[3,]-mu);md3............. ……………. ………….. md20=t(tdata[20,]-mu)%*%incov%*%(tdata[20,]-mu);md20md for package and manully are equal
tdatas1=sd(tdata[,1]);s1s2=sd(tdata[,2]);s2xstar=c(tdata[,1])/s1;xstarystar=c(tdata[,2])/s2;ystar
md1=sqrt((-1.46787309)^2 + (0.1484462)^2);md1md2=sqrt((-1.22516896 )^2 + ( 0.6020111 )^2);md2………. ………… ……………..Not equal to above distances……..Why ???????Take into account mean
05/01/2023
Statistical Distance under Rotated Coordinate System
22222112
2111
212
211
22
22
11
21
21
2),(
cossin~sincos~~~
~~
),(
)~,~(),0,0(
xaxxaxaPOd
xxxxxxsx
sxPOd
xxPO
are sample variances
• After some manipulation this can be written in terms of origin variables
Whereas
Proof…………• = =
= + 2 + = = - 2 +
Continued………….
=
Continued………….
05/01/2023
General Statistical Distance
)])((2))((2))((2
)(
)()([
),(
]222
[),(
),,,(),0,,0,0(),,,,(
11,1
331113221112
2
22222
21111
1,131132112
22222
2111
2121
pppppp
pppp
pppp
ppp
pp
yxyxayxyxayxyxa
yxa
yxayxa
QPd
xxaxxaxxa
xaxaxaPOd
yyyQOxxxP
• The above distances are completely determined by the coefficients(weights) These are can be arranged in rectangular array as
this array (matrix) must be symmetric positive definite.
Why Positive definite ???? Let A be a positive definite matrix .
A=C’C X’AX= X’C’CX = (CX)’(CX) = Y’Y It obeys all the distance property. X’AX is distance ,For different A it gives different distance .
• Why positive definite matrix ????????• Ans: Spectral decomposition : the spectral
decomposition of a kk symmetric matrix A is given by
• Where are pair of eigenvalues and eigenvectors.
And And if pd & invertible .
4.0 4.5 5.0 5.5 6.02
3
4
5
λ1λ2
𝑒1
𝑒2
• Suppose p=2. The distance from origin is
By spectral decomposition
X1
X2𝐶√ λ1
𝐶√ λ2
Another property is
Thus
We use this property in Mahalanobis distance
05/01/2023
Necessity of Statistical Distance
Center of gravity
Another point
• Consider the Euclidean distances from the point Q to the points P and the origin O.
• Obviously d(PQ) > d (QO )
But, P appears to be more like the points in the cluster than does the origin .
If we take into account the variability of the points in cluster and measure distance by statistical distance , then Q will be closer to P than O .
Mahalanobis distance
• The Mahalanobis distance is a descriptive statistic that provides a relative measure of a data point's distance from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936
Intuition of Mahalanobis Distance • Recall the eqution
d(O,P)= => = Where x= , A=
Intuition of Mahalanobis Distance
d(O,P)= Where ; A=
Intuition of Mahalanobis Distance
where, A=
Mahalanobis Distance
• Mahalanobis used ,inverse of covariance matrix instead of A
• Thus ……………..(1)
• And used instead of y ………..(2)
Mah-alan-obis
dist-ance
Mahalanobis Distance
• The above equations are nothing but Mahalanobis Distance ……
• For example, suppose we took a single observation from a bivariate population with Variable X and Variable Y, and that our two variables had the following characteristics
• single observation, X = 410 and Y = 400 The Mahalanobis distance for that single value as:
• ghk
1.825
• Therefore, our single observation would have a distance of 1.825 standardized units from the mean (mean is at X = 500, Y = 500).
• If we took many such observations, graphed them and colored them according to their Mahalanobis values, we can see the elliptical Mahalanobis regions come out
• The points are actually distributed along two primary axes:
If we calculate Mahalanobis distances for each of these points and shade them according to their distance value, we see clear elliptical patterns emerge:
• We can also draw actual ellipses at regions of constant Mahalanobis values:
68% obs
95% obs
99.7% obs
• Which ellipse do you choose ??????Ans : Use the 68-95-99.7 rule .
1) about two-thirds (68%) of the points should be within 1 unit of the origin (along the axis). 2) about 95% should be within 2 units 3)about 99.7 should be within 3 units
If normal
Sample Mahalanobis Distancce • The sample Mahalanobis distance is made by
replacing by S and by • i.e (X- )’ (X- )
For sample
(X- )’ (X- )
Distribution of mahalanobis distance
Distribution of mahalanobis distance Let be in dependent observation from any population with meanand finite (nonsingular) covariance Σ . Then is approximately and is approximately for n-p large This is nothing but central limit theorem
Mahalanobis distance in R
• ########### Mahalanobis Distance ##########
• x=rnorm(100);x
• dm=matrix(x,nrow=20,ncol=5,byrow=F);dm ##dm = data matrix
• cm=colMeans(dm);cm ## cm= column means
• cov=cov(dm);cov ##cov = covariance matrix
• incov=solve(cov);incov ##incov= inverse of
covarianc matrix
Mahalanobis distance in R• ####### MAHALANOBIS DISTANCE : MANUALY ######
• @@@ Mahalanobis distance of first • observation@@@@@@• ob1=dm[1,];ob1 ## first observation • mv1=ob1-cm;mv1 ## deviatiopn of first observation from center of gravity • md1=t(mv1)%*%incov%*%mv1;md1 ## mahalanobis distance of first observation from center of gravity •
Mahalanobis distance in R• @@@@@@ Mahalanobis distance of second observation@@@@@
• ob2=dm[2,];ob2 ## second observation • mv2=ob2-cm;mv2 ## deviatiopn of second • observation from • center of gravity • md2=t(mv2)%*%incov%*%mv2;md2 ##mahalanobis distance of second observation from center of gravity ................ ……………… …..……………
Mahalanobis distance in R ………....... ……………… ……………
@@@@@ Mahalanobis distance of 20th observation@@@@@• Ob20=dm[,20];ob20 [## 20th observation • mv20=ob20-cm;mv20 ## deviatiopn of 20th observation from center of gravity • md20=t(mv20)%*%incov%*%mv20;md20 ## mahalanobis distance of 20thobservation from center of gravity
Mahalanobis distance in R
####### MAHALANOBIS DISTANCE : PACKAGE ########
• md=mahalanobis(dm,cm,cov,inverted =F);md ## md =mahalanobis distance• md=mahalanobis(dm,cm,cov);md
Another example
• x <- matrix(rnorm(100*3), ncol = 3)
• Sx <- cov(x)
• D2 <- mahalanobis(x, colMeans(x), Sx)
• plot(density(D2, bw = 0.5), main="Squared Mahalanobis distances, n=100, p=3") • qqplot(qchisq(ppoints(100), df = 3), D2, main = expression("Q-Q plot of Mahalanobis" * ~D^2 * " vs. quantiles of" * ~ chi[3]^2))
• abline(0, 1, col = 'gray')• ?? mahalanobis
Acknowledgement
Prof . Mohammad Nasser . Richard A. Johnson & Dean W. Wichern . & others
THANK YOU ALL
Necessity of Statistical Distance
In home Mother
In mess Female
maid
Student in mess
top related