the statistical analysis of compositional data: the aitchison … · 2011-08-09 · v....
TRANSCRIPT
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
The statistical analysis ofcompositional data:
The Aitchison geometry
Prof. Dr. Vera Pawlowsky-GlahnProf. Dr. Juan Jose EgozcueAss. Prof. Dr. Rene Meziat
Instituto Colombiano del PetroleoPiedecuesta, Santander, Colombia
March 20–23, 2007
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
logo
IAMG Distinguished Lecturer – 2007
Prof. Dr. Vera Pawlowsky-Glahn
Department of Computer Science and Applied MathematicsUniversity of Girona, Spain
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
recall
compositional data are parts of some wholewhich only carry relative information
usual units of measurement: parts per unit,percentages, ppm, ppb, concentrations, ...
historically: data subject to a constant sumconstraint
examples: geochemical analysis; (sand, silt, clay)composition; proportions of minerals in a rock; ...
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
historical remarks: end of the XIXth century
Karl Pearson, 1897: “On a form of spurious correlationwhich may arise when indices are used in themeasurement of organs”
he was the first to point out dangers that may befall theanalyst who attempts to interpret correlations betweenratios whose numerators and denominators containcommon parts
the closure problem was stated within the framework ofclassical statistics, and thus within the framework ofEuclidean geometry in real space
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
the problem: negative bias & spurious correlation
example: scientists A and B record the composition of aliquots of soilsamples; A records (animal, vegetable, mineral, water) compositions,B records (animal, vegetable, mineral) after drying the sample; both areabsolutely accurate (adapted from Aitchison, 2005)
sample A x1 x2 x3 x4
1 0.1 0.2 0.1 0.62 0.2 0.1 0.2 0.53 0.3 0.3 0.1 0.3
sample B x ′1 x ′
2 x ′3
1 0.25 0.50 0.252 0.40 0.20 0.403 0.43 0.43 0.14
corr A x1 x2 x3 x4
x1 1.00 0.50 0.00 -0.98x2 1.00 -0.87 -0.65x3 1.00 0.19x4 1.00
corr B x ′1 x ′
2 x ′3
x ′1 1.00 -0.57 -0.05
x ′2 1.00 -0.79
x ′3 1.00
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
historical remarks: from 1897 to 1980 (and beyond)
the fact that correlations between closed data are inducedby numerical constraints caused Felix Chayes to attemptto separate the spurious part from the real correlation
(“On correlation between variables of constant sum”, 1960)
many studied the effects of closure on methods related tocorrelation and covariance analysis (principal componentanalysis, partial and canonical correlation analysis) ordistances (cluster analysis)
an exhaustive search was initiated within the frameworkof classical (applied) statistics
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
historical remarks: end of the XXth century
John Aitchison, 1982, 1986: “The statistical analysis ofcompositional data”
key idea: compositional data represent parts of somewhole; they only carry relative information
by analogy with the log-normal approach, Aitchisonprojected the sample space of compositional data,the D-part simplex SD, to real space RD−1 or RD,using log-ratio transformations
the log-ratio approach was born ...
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
compositional data: definition
definition: parts of some whole which carry only relativeinformation ⇐⇒ compositional data are equivalence classes
X2
1
1 X1
compositional data in R2 compositional data in R3
usual representation: subject to a constant sum constraint
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
compositional data: usual representation
definition: x = [x1, x2, . . . , xD] is a D-part composition
⇐⇒
xi > 0, for all i = 1, ..., DD∑
i=1xi = κ (constant)
κ = 1 ⇐⇒ measurements in parts per unitκ = 100 ⇐⇒ measurements in percent
other frequent units: ppm, ppb, ...
a subcomposition xs with s parts is obtained as the closure ofa subvector
[xi1 , xi2 , . . . , xis
]of x
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
the simplex as sample space
SD = {x = [x1, x2, . . . , xD]|xi > 0;D∑
i=1
xi = κ}
standard representation for D = 3:the ternary diagram
X1
X2
X3
x2
x1
x3
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
example 1: genetic hypothesis
MN
MM NN
data: genotyps in the MN system of blood groups; code: Ab = Aborigines;Ch = Chinese; In= Indian; AmIn = American Indian; Es = Eskimo;question: despite the high variability which can be observed, is there anyinherent stability in the data? do they follow any genetic law?
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
requirements for a proper analysis
scale invariance: the analysis should not depend on theclosure constant κ
permutation invariance: the order of the parts should beirrelevant
subcompositional coherence: studies performed onsubcompositions should not stand in contradiction withthose performed on the full composition
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
why a new geometry on the simplex?
in real space we add vectors, we multiply them by a constant, welook for orthogonality between vectors, we look for distancesbetween points, ...
possible because <D is a linear vector space
BUT Euclidean geometry is not a proper geometry for compositionaldata because
results might not be in the simplex when we addcompositional vectors, multiply them by a constant, or computeconfidence regions
Euclidean differences are not always reasonable: from0.05% to 0.10% the amount is doubled; from 50.05% to 50.10%the increase is negligible
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
basic operations
closure of z = [z1, z2, . . . , zD] ∈ <D+
C [z] =
[κ · z1∑D
i=1 zi,
κ · z2∑Di=1 zi
, · · · ,κ · zD∑D
i=1 zi
]
perturbation of x ∈ SD by y ∈ SD
x⊕ y = C [x1y1, x2y2, . . . , xDyD]
powering of x ∈ SD by α ∈ <
α� x = C [xα1 , xα
2 , . . . , xαD ]
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
interpretation of perturbation and powering
A
B C
A
B C
left: perturbation of initial compositions (◦) by p = [0.1, 0.1, 0.8]resulting in compositions (?)
right: powering of compositions (?) by α = 0.2 resulting incompositions (◦)
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
comments
closure = projection of a point in <D+ on SD
points on a ray are projected onto the same point
a ray in <D+ is an equivalence class
the point on SD is a representant of the class
a generalization to other representants is possible
for z ∈ <D+ and x ∈ SD, x⊕ (α� z) = x⊕ (α� C [z])
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
vector space structure of (SD,⊕,�)
commutative group structure of (SD,⊕)1 commutativity: x⊕ y = y⊕ x2 associativity: (x⊕ y)⊕ z = x⊕ (y⊕ z)3 neutral element: e = C [1, 1, . . . , 1] = barycentre of SD
4 inverse of x: x−1 = C[x−1
1 , x−12 , . . . , x−1
D
]⇒ x⊕ x−1 = e and x⊕ y−1 = x y
properties of powering1 associativity: α� (β � x) = (α · β)� x;2 distributivity 1: α� (x⊕ y) = (α� x)⊕ (α� y)3 distributivity 2: (α + β)� x = (α� x)⊕ (β � x)4 neutral element: 1� x = x
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
inner product space structure of (SD,⊕,�)
inner product : 〈x, y〉a =1
2D
D∑i=1
D∑j=1
lnxi
xjln
yi
yj, x, y ∈ SD
norm : |x|a =
√√√√ 12D
D∑i=1
D∑j=1
(ln
xi
xj
)2
, x ∈ SD
distance : da(x, y) =
√√√√ 12D
D∑i=1
D∑j=1
(ln
xi
xj− ln
yi
yj
)2
, x, y ∈ SD
Aitchison geometry on the simplex
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
properties of the Aitchison geometry
distance and perturbation: da(p⊕ x, p⊕ y) = da(x, y)
distance and powering: da(α� x, α� y) = |α|da(x, y)
compositional lines: y = x0 ⊕ (α� x)(x0 = starting point, x = leading vector)
orthogonal lines: y1 = x0 ⊕ (α1 � x1), y2 = x0 ⊕ (α2 � x2),
y1 ⊥y2 ⇐⇒ 〈x1, x2〉a = 0
(the inner product of the leading vectors is zero)parallel lines: y1 = x0 ⊕ (α� x) ‖ y2 = p⊕ x0 ⊕ (α� x)
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
orthogonal compositional lines
x y
z
x y
z
orthogonal grids in S3, equally spaced, 1 unit in Aitchisondistance; the right grid is rotated 45o with respect to the left grid
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
circles and other geometric figures
x2
x1
x3
n
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
advantages of Euclidean spaces
orthonormal basis can be constructed: {e1, . . . , eD−1}
coordinates obey the rules of real Euclidean space:
x ∈ SD ⇒ y = [y1, . . . , yD−1] ∈ RD−1, with yi = 〈x, ei〉astandard methods can be directly applied to coordinates
expressing results as compositions is easy:
if h : SD 7→ RD−1 assigns to each x ∈ SD its coordinates,i.e. h(x) = y, then
h−1(y) = x =D−1⊕i=1
yi � ei
V. Pawlowsky-Glahn and
J. J. Egozcue
CoDa historical remarks sample space Aitchison geometry final comments
conclusions
the Aitchison geometry of the simplex offers a new tool toanalyse CoDa
the geometry is apparently complex, but it is completelyequivalent to standard Euclidean geometry in real space
the key is to use a proper representation in coordinates