interesting statistical problem for hdlss data: when clusters seem to appear e.g. found by...
TRANSCRIPT
Interesting Statistical Problem
For HDLSS dataWhen clusters seem to appear
Eg found by clustering method
How do we know they are really thereQuestion asked by Neil Hayes
Define appropriate statistical significance
Can we calculate it
Statistical Significance of Clusters
Basis of SigClust Approach
What defines A Single ClusterA Gaussian distribution (Sarle amp Kou 1993)
So define SigClust test based on2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically
SigClust Gaussian null distributrsquon
Which Gaussian (for null)
Standard (sphered) normalNo not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian
Need Gaussian more like dataNeed Full modelChallenge Parameter EstimationRecall HDLSS Context
SigClust Gaussian null distributrsquon
Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI
When Data are (rigidly) shiftedCI remains the same
So enough to simulate with mean 0Other uses of invariance ideas
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis
But then ldquonot like datardquoOK since k-means clustering (ie CI) is
rotation invariant
(assuming eg Euclidean Distance)
tMDM
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems
Eg Perou 500 data
Dimension
Sample Size
Still need to estimate paramrsquos9674d533n
9674d
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Statistical Significance of Clusters
Basis of SigClust Approach
What defines A Single ClusterA Gaussian distribution (Sarle amp Kou 1993)
So define SigClust test based on2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically
SigClust Gaussian null distributrsquon
Which Gaussian (for null)
Standard (sphered) normalNo not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian
Need Gaussian more like dataNeed Full modelChallenge Parameter EstimationRecall HDLSS Context
SigClust Gaussian null distributrsquon
Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI
When Data are (rigidly) shiftedCI remains the same
So enough to simulate with mean 0Other uses of invariance ideas
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis
But then ldquonot like datardquoOK since k-means clustering (ie CI) is
rotation invariant
(assuming eg Euclidean Distance)
tMDM
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems
Eg Perou 500 data
Dimension
Sample Size
Still need to estimate paramrsquos9674d533n
9674d
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
Which Gaussian (for null)
Standard (sphered) normalNo not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian
Need Gaussian more like dataNeed Full modelChallenge Parameter EstimationRecall HDLSS Context
SigClust Gaussian null distributrsquon
Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI
When Data are (rigidly) shiftedCI remains the same
So enough to simulate with mean 0Other uses of invariance ideas
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis
But then ldquonot like datardquoOK since k-means clustering (ie CI) is
rotation invariant
(assuming eg Euclidean Distance)
tMDM
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems
Eg Perou 500 data
Dimension
Sample Size
Still need to estimate paramrsquos9674d533n
9674d
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI
When Data are (rigidly) shiftedCI remains the same
So enough to simulate with mean 0Other uses of invariance ideas
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis
But then ldquonot like datardquoOK since k-means clustering (ie CI) is
rotation invariant
(assuming eg Euclidean Distance)
tMDM
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems
Eg Perou 500 data
Dimension
Sample Size
Still need to estimate paramrsquos9674d533n
9674d
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis
But then ldquonot like datardquoOK since k-means clustering (ie CI) is
rotation invariant
(assuming eg Euclidean Distance)
tMDM
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems
Eg Perou 500 data
Dimension
Sample Size
Still need to estimate paramrsquos9674d533n
9674d
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
2nd Key Idea Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems
Eg Perou 500 data
Dimension
Sample Size
Still need to estimate paramrsquos9674d533n
9674d
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
3rd Key Idea Factor Analysis Model
Model Covariance as Biology + Noise
Where
is ldquofairly low dimensionalrdquo
is estimated from background noise2NB
INB 2
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
Estimation of Background Noise 2N
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
Estimation of Background Noise
Reasonable model (for each gene)
Expression = Signal + Noise
ldquonoiserdquo is roughly Gaussian
ldquonoiserdquo terms essentially independent
(across genes)
2N
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Background Noise
Hope MostEntries areldquoPure Noise (Gaussian)rdquo
A Few (ltlt frac14)Are BiologicalSignal ndashOutliers
How to Check
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plots
Background Graphical Goodness of Fit
Basis
Cumulative Distribution Function (CDF)
Probability quantile notation
for probabilityrdquo and quantile
xXPxF
p q
qFp pFq 1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plots
Two types of CDF
1 Theoretical
2 Empirical based on data nXX 1
qXPqFp
n
qXiqFp i
ˆˆ
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plots
Comparison Visualizations
(compare a theoretical with an empirical)
3P-P plot
plot vs
for a grid of values
4Q-Q plot
plot vs
for a grid of values
q
q
p p
p
q
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsIllustrative graphic (toy data set)
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsIllustrative graphic (toy data set)
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Alternate TerminologyQ-Q Plots = ROC curves
P-P Plots = ldquoPrecision Recallrdquo Curves
Highlights Different Distributional Aspects
Statistical Folklore Q-Q Highlights Tails
So Usually More Useful
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsGaussian departures from line
bull Looks much like
bull Wiggles all random variation
bull But there are n = 10000 data pointshellip
bull How to assess signal amp noise
bull Need to understand sampling variation
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsNeed to understand sampling variation
bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon
ndash Samples of same size
ndash About 100 samples gives
ldquogood visual impressionrdquo
ndash Overlay resulting 100 QQ-curves
ndash To visually convey natural sampling variation
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsGaussian departures from line
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Q-Q plotsGaussian departures from line
bull Harder to see
bull But clearly there
bull Conclude non-Gaussian
bull Really needed n = 10000 data pointshellip
(why bigger sample size was used)
bull Envelope plot reflects sampling variation
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Background Noise
n = 533 d = 9456
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Background Noise
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Background Noise
bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there
(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distributrsquon
Estimation of Biological Covariance
Keep only ldquolargerdquo eigenvalues
Defined as
So for null distribution use eigenvalues
B
d 21 2Nj
)max()max( 221 NdN
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Eigenvalrsquos
All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues
2N
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Eigenvalrsquos
Do we need the factor model Explore this with another data set
(with fewer genes) This time
n = 315 cases d = 306 genes
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Eigenvalrsquos
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Estimation of Eigenvalrsquos
Try another data set with fewer genesThis time
First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at
2N
2N
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distribution - Simulation
Now simulate from null distribution using
where (indep)
Again rotation invariance makes this work
(and location invariance)
jij NX 0~
id
i
X
X
1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Gaussian null distribution - Simulation
Then compare data CI
With simulated null population CIs
bull Spirit similar to DiProPermbull But now significance happens for
smaller values of CI
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
An example (details to follow)
P-val = 00045
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Modalities
Two major applications
I Test significance of given clusterings
(eg for those found in heat map)
(Use given class labels)
IITest if known cluster can be further split
(Use 2-means class labels)
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Real Data Results
Analyze Perou 500 breast cancer data
(large cross study combined data set)
Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Perou 500 PCA View ndash real clusters
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Perou 500 DWD Dirrsquons View ndash real clusters
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Perou 500 ndash Fundamental Question
Are Luminal A amp Luminal B really distinct clusters
Famous forFar Different Survivability
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Results for Luminal A vs Luminal B
P-val = 00045
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Results for Luminal A vs Luminal B
Get p-values from Empirical Quantile
From simulated sample CIs
Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Results for Luminal A vs Luminal B
I Test significance of given clusteringsbull Empirical p-val = 0
ndash Definitely 2 clusters
bull Gaussian fit p-val = 00045ndash same strong evidence
bull Conclude these really are two clusters
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Results for Luminal A vs Luminal B
II Test if known cluster can be further split
bull Empirical p-val = 0ndash definitely 2 clusters
bull Gaussian fit p-val = 10-10
ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)
bull Conclude these really are two clusters
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Real Data Results
Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19
Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10
Split Luminal A p-val = 10-7
Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Real Data Results
Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition
(insight about signal vs noise)bull How good are others
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Real Data Results
Experience with Other Data Sets Similar
Smaller data sets less power
Gene filtering more power
Lung Cancer more distinct clusters
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Real Data Results
Some Personal Observations
Experienced Analysts Impressively Good
SigClust can save them time
SigClust can help them with skeptics
SigClust essential for non-experts
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Overview
Works Well When Factor Part Not Used
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Overview
Works Well When Factor Part Not Used
Sample Eigenvalues Always Valid
But Can be Too Conservative
Above Factor Threshold Anti-Conservative
Problem Fixed by Soft Thresholding
(Huang et al 2014)
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
SigClust Open Problems
Improved Eigenvalue Estimation
More attention to Local Minima in 2-
means Clustering
Theoretical Null Distributions
Inference for k gt 2 means Clustering
Multiple Comparison Issues
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always
Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions
Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless
nlim
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insights
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
0limlimlimlim
dndn
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics
Modern Mathematical Statistics Based on asymptotic analysis Real Reasons
Approximation provides insightsCan find simple underlying structureIn complex situations
Thus various flavors are fine
Even desirable (find additional insights)
0limlimlimlim
dndn
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Personal Observations
HDLSS world ishellip
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
HDLSS Asymptotics
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
Personal Observations
HDLSS world ishellip
Surprising (many times)
[Think Irsquove got it and then hellip]
Mathematically Beautiful ()
Practically Relevant
HDLSS Asymptotics
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
Study Ideas From
Hall Marron and Neeman (2005)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquond
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Where are Data
Near Peak of Density
Thanks to psycnetapaorg
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
(measure how close to peak)
d
d
dd
d
IN
Z
Z
Z 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
Euclidean Distance to Origin (as )
d
d
dd
d
IN
Z
Z
Z 0~1
)1(pOdZ
212
1
2 ~ dOdZZ pd
d
j j
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
-Data lie roughly on surface of sphere
with radius
- Yet origin is point of highest density
- Paradox resolved by
density w r t Lebesgue Measure
d
)1(pOdZ
d
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates
Look At Integrand wrt Can Show Puts ~ All Weight Near
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
- Paradox resolved by
density w r t Lebesgue Measure
Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
As
Important Philosophical Consequence
ldquoAverage Peoplerdquo
Parents Lament
Why Canrsquot I Have Average Children
Theorem Impossible (over many factors)
d )1(pOdZ
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
d dd INZ 0~21Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
Euclidean Dist Between and
(as )
Distance tends to non-random constant
d
d
dd INZ 0~2
)1(221 pOdZZ
1Z
1Z 2Z
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
)1(221 pOdZZ
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
Distance tends to non-random constant
bullFactor since
Can extend to Where do they all go
(we can only perceive 3 dimrsquons)
)1(221 pOdZZ
nZZ
1
22
2121 XsdXsdXXsd
2
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
Ever Wonder Why
o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World
(we can only perceive 3 dimrsquons)
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
As Vectors From Origin
Thanks to memberstripodcom
d
d
dd INZ 0~21Z
1198851
119885 2
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asymptotics Simple Paradoxes
For dimrsquoal Standard Normal distrsquon
indep of
High dimrsquoal Angles (as )
- Everything is orthogonal
- Where do they all go
(again our perceptual limitations)
- Again 1st order structure is non-random
d
d
dd INZ 0~2
)(90 2121
dOZZAngle p
1Z
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension n
d ddn INZZ 0~1
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
n
d ddn INZZ 0~1
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
(Modulo Rotation)
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Subspace Generated by Data
Hyperplane through 0
of dimension
Points are ldquonearly equidistant to 0rdquo
amp dist
Within plane can
ldquorotate towards Unit Simplexrdquo
All Gaussian data sets are
ldquonear Unit Simplex Verticesrdquo
ldquoRandomnessrdquo appears
only in rotation of simplex
n
d ddn INZZ 0~1
d
d
Hall Marron amp Neeman (2005)
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane1n
d ddn INZZ 0~1
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane
Points are pairwise equidistant dist
1n
d ddn INZZ 0~1
d2~
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane
Points are pairwise equidistant dist
Points lie at vertices of
ldquoregular hedronrdquo
1n
d ddn INZZ 0~1
d2
d2~
n
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane
Points are pairwise equidistant dist
Points lie at vertices of
ldquoregular hedronrdquo
Again ldquorandomness in datardquo is only in rotation
1n
d ddn INZZ 0~1
d2
d2~
n
HDLSS Asyrsquos Geometrical Representrsquon
Assume let
Study Hyperplane Generated by Data
dimensional hyperplane
Points are pairwise equidistant dist
Points lie at vertices of
ldquoregular hedronrdquo
Again ldquorandomness in datardquo is only in rotation
Surprisingly rigid structure in random data
1n
d ddn INZZ 0~1
d2
d2~
n
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View Shows ldquoRigidity after Rotationrdquo
- Interesting Statistical Problem
- Statistical Significance of Clusters
- SigClust Gaussian null distributrsquon
- SigClust Gaussian null distributrsquon (2)
- SigClust Gaussian null distributrsquon (3)
- SigClust Gaussian null distributrsquon (4)
- SigClust Gaussian null distributrsquon (5)
- SigClust Gaussian null distributrsquon (6)
- SigClust Gaussian null distributrsquon (7)
- SigClust Estimation of Background Noise
- Q-Q plots
- Q-Q plots (2)
- Q-Q plots (3)
- Q-Q plots (4)
- Q-Q plots (5)
- Q-Q plots (6)
- Alternate Terminology
- Q-Q plots (7)
- Q-Q plots (8)
- Q-Q plots (9)
- Q-Q plots (10)
- Q-Q plots (11)
- SigClust Estimation of Background Noise (2)
- SigClust Estimation of Background Noise (3)
- SigClust Estimation of Background Noise (4)
- SigClust Gaussian null distributrsquon (8)
- SigClust Estimation of Eigenvalrsquos
- SigClust Estimation of Eigenvalrsquos (2)
- SigClust Estimation of Eigenvalrsquos (3)
- SigClust Estimation of Eigenvalrsquos (4)
- SigClust Estimation of Eigenvalrsquos (5)
- SigClust Gaussian null distribution - Simulation
- SigClust Gaussian null distribution - Simulation (2)
- An example (details to follow)
- SigClust Modalities
- SigClust Modalities (2)
- SigClust Real Data Results
- Perou 500 PCA View ndash real clusters
- Perou 500 DWD Dirrsquons View ndash real clusters
- Perou 500 ndash Fundamental Question
- SigClust Results for Luminal A vs Luminal B
- SigClust Results for Luminal A vs Luminal B (2)
- SigClust Results for Luminal A vs Luminal B (3)
- SigClust Results for Luminal A vs Luminal B (4)
- SigClust Results for Luminal A vs Luminal B (5)
- SigClust Real Data Results (2)
- SigClust Real Data Results (3)
- SigClust Real Data Results (4)
- SigClust Real Data Results (5)
- SigClust Overview
- SigClust Overview (2)
- SigClust Overview (3)
- SigClust Overview (4)
- SigClust Open Problems
- HDLSS Asymptotics
- HDLSS Asymptotics (2)
- HDLSS Asymptotics (3)
- HDLSS Asymptotics (4)
- HDLSS Asymptotics (5)
- HDLSS Asymptotics (6)
- HDLSS Asymptotics (7)
- HDLSS Asymptotics (8)
- HDLSS Asymptotics (9)
- HDLSS Asymptotics (10)
- HDLSS Asymptotics (11)
- HDLSS Asymptotics (12)
- HDLSS Asymptotics (13)
- HDLSS Asymptotics (14)
- HDLSS Asymptotics (15)
- HDLSS Asymptotics (16)
- HDLSS Asymptotics Simple Paradoxes
- HDLSS Asymptotics Simple Paradoxes (2)
- HDLSS Asymptotics Simple Paradoxes (3)
- HDLSS Asymptotics Simple Paradoxes (4)
- HDLSS Asymptotics Simple Paradoxes (5)
- HDLSS Asymptotics Simple Paradoxes (6)
- HDLSS Asymptotics Simple Paradoxes (7)
- HDLSS Asymptotics Simple Paradoxes (8)
- HDLSS Asymptotics Simple Paradoxes (9)
- HDLSS Asymptotics Simple Paradoxes (10)
- HDLSS Asymptotics Simple Paradoxes (11)
- HDLSS Asymptotics Simple Paradoxes (12)
- HDLSS Asymptotics Simple Paradoxes (13)
- HDLSS Asymptotics Simple Paradoxes (14)
- HDLSS Asymptotics Simple Paradoxes (15)
- HDLSS Asymptotics Simple Paradoxes (16)
- HDLSS Asymptotics Simple Paradoxes (17)
- HDLSS Asymptotics Simple Paradoxes (18)
- HDLSS Asymptotics Simple Paradoxes (19)
- HDLSS Asymptotics Simple Paradoxes (20)
- HDLSS Asymptotics Simple Paradoxes (21)
- HDLSS Asymptotics Simple Paradoxes (22)
- HDLSS Asymptotics Simple Paradoxes (23)
- HDLSS Asymptotics Simple Paradoxes (24)
- HDLSS Asymptotics Simple Paradoxes (25)
- HDLSS Asymptotics Simple Paradoxes (26)
- HDLSS Asymptotics Simple Paradoxes (27)
- HDLSS Asymptotics Simple Paradoxes (28)
- HDLSS Asymptotics Simple Paradoxes (29)
- HDLSS Asymptotics Simple Paradoxes (30)
- HDLSS Asymptotics Simple Paradoxes (31)
- HDLSS Asyrsquos Geometrical Representrsquon
- HDLSS Asyrsquos Geometrical Representrsquon (2)
- HDLSS Asyrsquos Geometrical Representrsquon (3)
- HDLSS Asyrsquos Geometrical Representrsquon (4)
- HDLSS Asyrsquos Geometrical Representrsquon (5)
- HDLSS Asyrsquos Geometrical Representrsquon (6)
- HDLSS Asyrsquos Geometrical Representrsquon (7)
- HDLSS Asyrsquos Geometrical Representrsquon (8)
- HDLSS Asyrsquos Geometrical Representrsquon (9)
- HDLSS Asyrsquos Geometrical Representrsquon (10)
- HDLSS Asyrsquos Geometrical Representrsquon (11)
- HDLSS Asyrsquos Geometrical Represenrsquotion
- HDLSS Asyrsquos Geometrical Represenrsquotion (2)
- HDLSS Asyrsquos Geometrical Represenrsquotion (3)
- HDLSS Asyrsquos Geometrical Represenrsquotion (4)
- HDLSS Asyrsquos Geometrical Represenrsquotion (5)
- HDLSS Asyrsquos Geometrical Represenrsquotion (6)
-
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View Shows ldquoRigidity after Rotationrdquo
- Interesting Statistical Problem
- Statistical Significance of Clusters
- SigClust Gaussian null distributrsquon
- SigClust Gaussian null distributrsquon (2)
- SigClust Gaussian null distributrsquon (3)
- SigClust Gaussian null distributrsquon (4)
- SigClust Gaussian null distributrsquon (5)
- SigClust Gaussian null distributrsquon (6)
- SigClust Gaussian null distributrsquon (7)
- SigClust Estimation of Background Noise
- Q-Q plots
- Q-Q plots (2)
- Q-Q plots (3)
- Q-Q plots (4)
- Q-Q plots (5)
- Q-Q plots (6)
- Alternate Terminology
- Q-Q plots (7)
- Q-Q plots (8)
- Q-Q plots (9)
- Q-Q plots (10)
- Q-Q plots (11)
- SigClust Estimation of Background Noise (2)
- SigClust Estimation of Background Noise (3)
- SigClust Estimation of Background Noise (4)
- SigClust Gaussian null distributrsquon (8)
- SigClust Estimation of Eigenvalrsquos
- SigClust Estimation of Eigenvalrsquos (2)
- SigClust Estimation of Eigenvalrsquos (3)
- SigClust Estimation of Eigenvalrsquos (4)
- SigClust Estimation of Eigenvalrsquos (5)
- SigClust Gaussian null distribution - Simulation
- SigClust Gaussian null distribution - Simulation (2)
- An example (details to follow)
- SigClust Modalities
- SigClust Modalities (2)
- SigClust Real Data Results
- Perou 500 PCA View ndash real clusters
- Perou 500 DWD Dirrsquons View ndash real clusters
- Perou 500 ndash Fundamental Question
- SigClust Results for Luminal A vs Luminal B
- SigClust Results for Luminal A vs Luminal B (2)
- SigClust Results for Luminal A vs Luminal B (3)
- SigClust Results for Luminal A vs Luminal B (4)
- SigClust Results for Luminal A vs Luminal B (5)
- SigClust Real Data Results (2)
- SigClust Real Data Results (3)
- SigClust Real Data Results (4)
- SigClust Real Data Results (5)
- SigClust Overview
- SigClust Overview (2)
- SigClust Overview (3)
- SigClust Overview (4)
- SigClust Open Problems
- HDLSS Asymptotics
- HDLSS Asymptotics (2)
- HDLSS Asymptotics (3)
- HDLSS Asymptotics (4)
- HDLSS Asymptotics (5)
- HDLSS Asymptotics (6)
- HDLSS Asymptotics (7)
- HDLSS Asymptotics (8)
- HDLSS Asymptotics (9)
- HDLSS Asymptotics (10)
- HDLSS Asymptotics (11)
- HDLSS Asymptotics (12)
- HDLSS Asymptotics (13)
- HDLSS Asymptotics (14)
- HDLSS Asymptotics (15)
- HDLSS Asymptotics (16)
- HDLSS Asymptotics Simple Paradoxes
- HDLSS Asymptotics Simple Paradoxes (2)
- HDLSS Asymptotics Simple Paradoxes (3)
- HDLSS Asymptotics Simple Paradoxes (4)
- HDLSS Asymptotics Simple Paradoxes (5)
- HDLSS Asymptotics Simple Paradoxes (6)
- HDLSS Asymptotics Simple Paradoxes (7)
- HDLSS Asymptotics Simple Paradoxes (8)
- HDLSS Asymptotics Simple Paradoxes (9)
- HDLSS Asymptotics Simple Paradoxes (10)
- HDLSS Asymptotics Simple Paradoxes (11)
- HDLSS Asymptotics Simple Paradoxes (12)
- HDLSS Asymptotics Simple Paradoxes (13)
- HDLSS Asymptotics Simple Paradoxes (14)
- HDLSS Asymptotics Simple Paradoxes (15)
- HDLSS Asymptotics Simple Paradoxes (16)
- HDLSS Asymptotics Simple Paradoxes (17)
- HDLSS Asymptotics Simple Paradoxes (18)
- HDLSS Asymptotics Simple Paradoxes (19)
- HDLSS Asymptotics Simple Paradoxes (20)
- HDLSS Asymptotics Simple Paradoxes (21)
- HDLSS Asymptotics Simple Paradoxes (22)
- HDLSS Asymptotics Simple Paradoxes (23)
- HDLSS Asymptotics Simple Paradoxes (24)
- HDLSS Asymptotics Simple Paradoxes (25)
- HDLSS Asymptotics Simple Paradoxes (26)
- HDLSS Asymptotics Simple Paradoxes (27)
- HDLSS Asymptotics Simple Paradoxes (28)
- HDLSS Asymptotics Simple Paradoxes (29)
- HDLSS Asymptotics Simple Paradoxes (30)
- HDLSS Asymptotics Simple Paradoxes (31)
- HDLSS Asyrsquos Geometrical Representrsquon
- HDLSS Asyrsquos Geometrical Representrsquon (2)
- HDLSS Asyrsquos Geometrical Representrsquon (3)
- HDLSS Asyrsquos Geometrical Representrsquon (4)
- HDLSS Asyrsquos Geometrical Representrsquon (5)
- HDLSS Asyrsquos Geometrical Representrsquon (6)
- HDLSS Asyrsquos Geometrical Representrsquon (7)
- HDLSS Asyrsquos Geometrical Representrsquon (8)
- HDLSS Asyrsquos Geometrical Representrsquon (9)
- HDLSS Asyrsquos Geometrical Representrsquon (10)
- HDLSS Asyrsquos Geometrical Representrsquon (11)
- HDLSS Asyrsquos Geometrical Represenrsquotion
- HDLSS Asyrsquos Geometrical Represenrsquotion (2)
- HDLSS Asyrsquos Geometrical Represenrsquotion (3)
- HDLSS Asyrsquos Geometrical Represenrsquotion (4)
- HDLSS Asyrsquos Geometrical Represenrsquotion (5)
- HDLSS Asyrsquos Geometrical Represenrsquotion (6)
-
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View Shows ldquoRigidity after Rotationrdquo
- Interesting Statistical Problem
- Statistical Significance of Clusters
- SigClust Gaussian null distributrsquon
- SigClust Gaussian null distributrsquon (2)
- SigClust Gaussian null distributrsquon (3)
- SigClust Gaussian null distributrsquon (4)
- SigClust Gaussian null distributrsquon (5)
- SigClust Gaussian null distributrsquon (6)
- SigClust Gaussian null distributrsquon (7)
- SigClust Estimation of Background Noise
- Q-Q plots
- Q-Q plots (2)
- Q-Q plots (3)
- Q-Q plots (4)
- Q-Q plots (5)
- Q-Q plots (6)
- Alternate Terminology
- Q-Q plots (7)
- Q-Q plots (8)
- Q-Q plots (9)
- Q-Q plots (10)
- Q-Q plots (11)
- SigClust Estimation of Background Noise (2)
- SigClust Estimation of Background Noise (3)
- SigClust Estimation of Background Noise (4)
- SigClust Gaussian null distributrsquon (8)
- SigClust Estimation of Eigenvalrsquos
- SigClust Estimation of Eigenvalrsquos (2)
- SigClust Estimation of Eigenvalrsquos (3)
- SigClust Estimation of Eigenvalrsquos (4)
- SigClust Estimation of Eigenvalrsquos (5)
- SigClust Gaussian null distribution - Simulation
- SigClust Gaussian null distribution - Simulation (2)
- An example (details to follow)
- SigClust Modalities
- SigClust Modalities (2)
- SigClust Real Data Results
- Perou 500 PCA View ndash real clusters
- Perou 500 DWD Dirrsquons View ndash real clusters
- Perou 500 ndash Fundamental Question
- SigClust Results for Luminal A vs Luminal B
- SigClust Results for Luminal A vs Luminal B (2)
- SigClust Results for Luminal A vs Luminal B (3)
- SigClust Results for Luminal A vs Luminal B (4)
- SigClust Results for Luminal A vs Luminal B (5)
- SigClust Real Data Results (2)
- SigClust Real Data Results (3)
- SigClust Real Data Results (4)
- SigClust Real Data Results (5)
- SigClust Overview
- SigClust Overview (2)
- SigClust Overview (3)
- SigClust Overview (4)
- SigClust Open Problems
- HDLSS Asymptotics
- HDLSS Asymptotics (2)
- HDLSS Asymptotics (3)
- HDLSS Asymptotics (4)
- HDLSS Asymptotics (5)
- HDLSS Asymptotics (6)
- HDLSS Asymptotics (7)
- HDLSS Asymptotics (8)
- HDLSS Asymptotics (9)
- HDLSS Asymptotics (10)
- HDLSS Asymptotics (11)
- HDLSS Asymptotics (12)
- HDLSS Asymptotics (13)
- HDLSS Asymptotics (14)
- HDLSS Asymptotics (15)
- HDLSS Asymptotics (16)
- HDLSS Asymptotics Simple Paradoxes
- HDLSS Asymptotics Simple Paradoxes (2)
- HDLSS Asymptotics Simple Paradoxes (3)
- HDLSS Asymptotics Simple Paradoxes (4)
- HDLSS Asymptotics Simple Paradoxes (5)
- HDLSS Asymptotics Simple Paradoxes (6)
- HDLSS Asymptotics Simple Paradoxes (7)
- HDLSS Asymptotics Simple Paradoxes (8)
- HDLSS Asymptotics Simple Paradoxes (9)
- HDLSS Asymptotics Simple Paradoxes (10)
- HDLSS Asymptotics Simple Paradoxes (11)
- HDLSS Asymptotics Simple Paradoxes (12)
- HDLSS Asymptotics Simple Paradoxes (13)
- HDLSS Asymptotics Simple Paradoxes (14)
- HDLSS Asymptotics Simple Paradoxes (15)
- HDLSS Asymptotics Simple Paradoxes (16)
- HDLSS Asymptotics Simple Paradoxes (17)
- HDLSS Asymptotics Simple Paradoxes (18)
- HDLSS Asymptotics Simple Paradoxes (19)
- HDLSS Asymptotics Simple Paradoxes (20)
- HDLSS Asymptotics Simple Paradoxes (21)
- HDLSS Asymptotics Simple Paradoxes (22)
- HDLSS Asymptotics Simple Paradoxes (23)
- HDLSS Asymptotics Simple Paradoxes (24)
- HDLSS Asymptotics Simple Paradoxes (25)
- HDLSS Asymptotics Simple Paradoxes (26)
- HDLSS Asymptotics Simple Paradoxes (27)
- HDLSS Asymptotics Simple Paradoxes (28)
- HDLSS Asymptotics Simple Paradoxes (29)
- HDLSS Asymptotics Simple Paradoxes (30)
- HDLSS Asymptotics Simple Paradoxes (31)
- HDLSS Asyrsquos Geometrical Representrsquon
- HDLSS Asyrsquos Geometrical Representrsquon (2)
- HDLSS Asyrsquos Geometrical Representrsquon (3)
- HDLSS Asyrsquos Geometrical Representrsquon (4)
- HDLSS Asyrsquos Geometrical Representrsquon (5)
- HDLSS Asyrsquos Geometrical Representrsquon (6)
- HDLSS Asyrsquos Geometrical Representrsquon (7)
- HDLSS Asyrsquos Geometrical Representrsquon (8)
- HDLSS Asyrsquos Geometrical Representrsquon (9)
- HDLSS Asyrsquos Geometrical Representrsquon (10)
- HDLSS Asyrsquos Geometrical Representrsquon (11)
- HDLSS Asyrsquos Geometrical Represenrsquotion
- HDLSS Asyrsquos Geometrical Represenrsquotion (2)
- HDLSS Asyrsquos Geometrical Represenrsquotion (3)
- HDLSS Asyrsquos Geometrical Represenrsquotion (4)
- HDLSS Asyrsquos Geometrical Represenrsquotion (5)
- HDLSS Asyrsquos Geometrical Represenrsquotion (6)
-
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View Shows ldquoRigidity after Rotationrdquo
- Interesting Statistical Problem
- Statistical Significance of Clusters
- SigClust Gaussian null distributrsquon
- SigClust Gaussian null distributrsquon (2)
- SigClust Gaussian null distributrsquon (3)
- SigClust Gaussian null distributrsquon (4)
- SigClust Gaussian null distributrsquon (5)
- SigClust Gaussian null distributrsquon (6)
- SigClust Gaussian null distributrsquon (7)
- SigClust Estimation of Background Noise
- Q-Q plots
- Q-Q plots (2)
- Q-Q plots (3)
- Q-Q plots (4)
- Q-Q plots (5)
- Q-Q plots (6)
- Alternate Terminology
- Q-Q plots (7)
- Q-Q plots (8)
- Q-Q plots (9)
- Q-Q plots (10)
- Q-Q plots (11)
- SigClust Estimation of Background Noise (2)
- SigClust Estimation of Background Noise (3)
- SigClust Estimation of Background Noise (4)
- SigClust Gaussian null distributrsquon (8)
- SigClust Estimation of Eigenvalrsquos
- SigClust Estimation of Eigenvalrsquos (2)
- SigClust Estimation of Eigenvalrsquos (3)
- SigClust Estimation of Eigenvalrsquos (4)
- SigClust Estimation of Eigenvalrsquos (5)
- SigClust Gaussian null distribution - Simulation
- SigClust Gaussian null distribution - Simulation (2)
- An example (details to follow)
- SigClust Modalities
- SigClust Modalities (2)
- SigClust Real Data Results
- Perou 500 PCA View ndash real clusters
- Perou 500 DWD Dirrsquons View ndash real clusters
- Perou 500 ndash Fundamental Question
- SigClust Results for Luminal A vs Luminal B
- SigClust Results for Luminal A vs Luminal B (2)
- SigClust Results for Luminal A vs Luminal B (3)
- SigClust Results for Luminal A vs Luminal B (4)
- SigClust Results for Luminal A vs Luminal B (5)
- SigClust Real Data Results (2)
- SigClust Real Data Results (3)
- SigClust Real Data Results (4)
- SigClust Real Data Results (5)
- SigClust Overview
- SigClust Overview (2)
- SigClust Overview (3)
- SigClust Overview (4)
- SigClust Open Problems
- HDLSS Asymptotics
- HDLSS Asymptotics (2)
- HDLSS Asymptotics (3)
- HDLSS Asymptotics (4)
- HDLSS Asymptotics (5)
- HDLSS Asymptotics (6)
- HDLSS Asymptotics (7)
- HDLSS Asymptotics (8)
- HDLSS Asymptotics (9)
- HDLSS Asymptotics (10)
- HDLSS Asymptotics (11)
- HDLSS Asymptotics (12)
- HDLSS Asymptotics (13)
- HDLSS Asymptotics (14)
- HDLSS Asymptotics (15)
- HDLSS Asymptotics (16)
- HDLSS Asymptotics Simple Paradoxes
- HDLSS Asymptotics Simple Paradoxes (2)
- HDLSS Asymptotics Simple Paradoxes (3)
- HDLSS Asymptotics Simple Paradoxes (4)
- HDLSS Asymptotics Simple Paradoxes (5)
- HDLSS Asymptotics Simple Paradoxes (6)
- HDLSS Asymptotics Simple Paradoxes (7)
- HDLSS Asymptotics Simple Paradoxes (8)
- HDLSS Asymptotics Simple Paradoxes (9)
- HDLSS Asymptotics Simple Paradoxes (10)
- HDLSS Asymptotics Simple Paradoxes (11)
- HDLSS Asymptotics Simple Paradoxes (12)
- HDLSS Asymptotics Simple Paradoxes (13)
- HDLSS Asymptotics Simple Paradoxes (14)
- HDLSS Asymptotics Simple Paradoxes (15)
- HDLSS Asymptotics Simple Paradoxes (16)
- HDLSS Asymptotics Simple Paradoxes (17)
- HDLSS Asymptotics Simple Paradoxes (18)
- HDLSS Asymptotics Simple Paradoxes (19)
- HDLSS Asymptotics Simple Paradoxes (20)
- HDLSS Asymptotics Simple Paradoxes (21)
- HDLSS Asymptotics Simple Paradoxes (22)
- HDLSS Asymptotics Simple Paradoxes (23)
- HDLSS Asymptotics Simple Paradoxes (24)
- HDLSS Asymptotics Simple Paradoxes (25)
- HDLSS Asymptotics Simple Paradoxes (26)
- HDLSS Asymptotics Simple Paradoxes (27)
- HDLSS Asymptotics Simple Paradoxes (28)
- HDLSS Asymptotics Simple Paradoxes (29)
- HDLSS Asymptotics Simple Paradoxes (30)
- HDLSS Asymptotics Simple Paradoxes (31)
- HDLSS Asyrsquos Geometrical Representrsquon
- HDLSS Asyrsquos Geometrical Representrsquon (2)
- HDLSS Asyrsquos Geometrical Representrsquon (3)
- HDLSS Asyrsquos Geometrical Representrsquon (4)
- HDLSS Asyrsquos Geometrical Representrsquon (5)
- HDLSS Asyrsquos Geometrical Representrsquon (6)
- HDLSS Asyrsquos Geometrical Representrsquon (7)
- HDLSS Asyrsquos Geometrical Representrsquon (8)
- HDLSS Asyrsquos Geometrical Representrsquon (9)
- HDLSS Asyrsquos Geometrical Representrsquon (10)
- HDLSS Asyrsquos Geometrical Representrsquon (11)
- HDLSS Asyrsquos Geometrical Represenrsquotion
- HDLSS Asyrsquos Geometrical Represenrsquotion (2)
- HDLSS Asyrsquos Geometrical Represenrsquotion (3)
- HDLSS Asyrsquos Geometrical Represenrsquotion (4)
- HDLSS Asyrsquos Geometrical Represenrsquotion (5)
- HDLSS Asyrsquos Geometrical Represenrsquotion (6)
-
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View Shows ldquoRigidity after Rotationrdquo
- Interesting Statistical Problem
- Statistical Significance of Clusters
- SigClust Gaussian null distributrsquon
- SigClust Gaussian null distributrsquon (2)
- SigClust Gaussian null distributrsquon (3)
- SigClust Gaussian null distributrsquon (4)
- SigClust Gaussian null distributrsquon (5)
- SigClust Gaussian null distributrsquon (6)
- SigClust Gaussian null distributrsquon (7)
- SigClust Estimation of Background Noise
- Q-Q plots
- Q-Q plots (2)
- Q-Q plots (3)
- Q-Q plots (4)
- Q-Q plots (5)
- Q-Q plots (6)
- Alternate Terminology
- Q-Q plots (7)
- Q-Q plots (8)
- Q-Q plots (9)
- Q-Q plots (10)
- Q-Q plots (11)
- SigClust Estimation of Background Noise (2)
- SigClust Estimation of Background Noise (3)
- SigClust Estimation of Background Noise (4)
- SigClust Gaussian null distributrsquon (8)
- SigClust Estimation of Eigenvalrsquos
- SigClust Estimation of Eigenvalrsquos (2)
- SigClust Estimation of Eigenvalrsquos (3)
- SigClust Estimation of Eigenvalrsquos (4)
- SigClust Estimation of Eigenvalrsquos (5)
- SigClust Gaussian null distribution - Simulation
- SigClust Gaussian null distribution - Simulation (2)
- An example (details to follow)
- SigClust Modalities
- SigClust Modalities (2)
- SigClust Real Data Results
- Perou 500 PCA View ndash real clusters
- Perou 500 DWD Dirrsquons View ndash real clusters
- Perou 500 ndash Fundamental Question
- SigClust Results for Luminal A vs Luminal B
- SigClust Results for Luminal A vs Luminal B (2)
- SigClust Results for Luminal A vs Luminal B (3)
- SigClust Results for Luminal A vs Luminal B (4)
- SigClust Results for Luminal A vs Luminal B (5)
- SigClust Real Data Results (2)
- SigClust Real Data Results (3)
- SigClust Real Data Results (4)
- SigClust Real Data Results (5)
- SigClust Overview
- SigClust Overview (2)
- SigClust Overview (3)
- SigClust Overview (4)
- SigClust Open Problems
- HDLSS Asymptotics
- HDLSS Asymptotics (2)
- HDLSS Asymptotics (3)
- HDLSS Asymptotics (4)
- HDLSS Asymptotics (5)
- HDLSS Asymptotics (6)
- HDLSS Asymptotics (7)
- HDLSS Asymptotics (8)
- HDLSS Asymptotics (9)
- HDLSS Asymptotics (10)
- HDLSS Asymptotics (11)
- HDLSS Asymptotics (12)
- HDLSS Asymptotics (13)
- HDLSS Asymptotics (14)
- HDLSS Asymptotics (15)
- HDLSS Asymptotics (16)
- HDLSS Asymptotics Simple Paradoxes
- HDLSS Asymptotics Simple Paradoxes (2)
- HDLSS Asymptotics Simple Paradoxes (3)
- HDLSS Asymptotics Simple Paradoxes (4)
- HDLSS Asymptotics Simple Paradoxes (5)
- HDLSS Asymptotics Simple Paradoxes (6)
- HDLSS Asymptotics Simple Paradoxes (7)
- HDLSS Asymptotics Simple Paradoxes (8)
- HDLSS Asymptotics Simple Paradoxes (9)
- HDLSS Asymptotics Simple Paradoxes (10)
- HDLSS Asymptotics Simple Paradoxes (11)
- HDLSS Asymptotics Simple Paradoxes (12)
- HDLSS Asymptotics Simple Paradoxes (13)
- HDLSS Asymptotics Simple Paradoxes (14)
- HDLSS Asymptotics Simple Paradoxes (15)
- HDLSS Asymptotics Simple Paradoxes (16)
- HDLSS Asymptotics Simple Paradoxes (17)
- HDLSS Asymptotics Simple Paradoxes (18)
- HDLSS Asymptotics Simple Paradoxes (19)
- HDLSS Asymptotics Simple Paradoxes (20)
- HDLSS Asymptotics Simple Paradoxes (21)
- HDLSS Asymptotics Simple Paradoxes (22)
- HDLSS Asymptotics Simple Paradoxes (23)
- HDLSS Asymptotics Simple Paradoxes (24)
- HDLSS Asymptotics Simple Paradoxes (25)
- HDLSS Asymptotics Simple Paradoxes (26)
- HDLSS Asymptotics Simple Paradoxes (27)
- HDLSS Asymptotics Simple Paradoxes (28)
- HDLSS Asymptotics Simple Paradoxes (29)
- HDLSS Asymptotics Simple Paradoxes (30)
- HDLSS Asymptotics Simple Paradoxes (31)
- HDLSS Asyrsquos Geometrical Representrsquon
- HDLSS Asyrsquos Geometrical Representrsquon (2)
- HDLSS Asyrsquos Geometrical Representrsquon (3)
- HDLSS Asyrsquos Geometrical Representrsquon (4)
- HDLSS Asyrsquos Geometrical Representrsquon (5)
- HDLSS Asyrsquos Geometrical Representrsquon (6)
- HDLSS Asyrsquos Geometrical Representrsquon (7)
- HDLSS Asyrsquos Geometrical Representrsquon (8)
- HDLSS Asyrsquos Geometrical Representrsquon (9)
- HDLSS Asyrsquos Geometrical Representrsquon (10)
- HDLSS Asyrsquos Geometrical Representrsquon (11)
- HDLSS Asyrsquos Geometrical Represenrsquotion
- HDLSS Asyrsquos Geometrical Represenrsquotion (2)
- HDLSS Asyrsquos Geometrical Represenrsquotion (3)
- HDLSS Asyrsquos Geometrical Represenrsquotion (4)
- HDLSS Asyrsquos Geometrical Represenrsquotion (5)
- HDLSS Asyrsquos Geometrical Represenrsquotion (6)
-
HDLSS Asyrsquos Geometrical Represenrsquotion
Simulation View Shows ldquoRigidity after Rotationrdquo
- Interesting Statistical Problem
- Statistical Significance of Clusters
- SigClust Gaussian null distributrsquon
- SigClust Gaussian null distributrsquon (2)
- SigClust Gaussian null distributrsquon (3)
- SigClust Gaussian null distributrsquon (4)
- SigClust Gaussian null distributrsquon (5)
- SigClust Gaussian null distributrsquon (6)
- SigClust Gaussian null distributrsquon (7)
- SigClust Estimation of Background Noise
- Q-Q plots
- Q-Q plots (2)
- Q-Q plots (3)
- Q-Q plots (4)
- Q-Q plots (5)
- Q-Q plots (6)
- Alternate Terminology
- Q-Q plots (7)
- Q-Q plots (8)
- Q-Q plots (9)
- Q-Q plots (10)
- Q-Q plots (11)
- SigClust Estimation of Background Noise (2)
- SigClust Estimation of Background Noise (3)
- SigClust Estimation of Background Noise (4)
- SigClust Gaussian null distributrsquon (8)
- SigClust Estimation of Eigenvalrsquos
- SigClust Estimation of Eigenvalrsquos (2)
- SigClust Estimation of Eigenvalrsquos (3)
- SigClust Estimation of Eigenvalrsquos (4)
- SigClust Estimation of Eigenvalrsquos (5)
- SigClust Gaussian null distribution - Simulation
- SigClust Gaussian null distribution - Simulation (2)
- An example (details to follow)
- SigClust Modalities
- SigClust Modalities (2)
- SigClust Real Data Results
- Perou 500 PCA View ndash real clusters
- Perou 500 DWD Dirrsquons View ndash real clusters
- Perou 500 ndash Fundamental Question
- SigClust Results for Luminal A vs Luminal B
- SigClust Results for Luminal A vs Luminal B (2)
- SigClust Results for Luminal A vs Luminal B (3)
- SigClust Results for Luminal A vs Luminal B (4)
- SigClust Results for Luminal A vs Luminal B (5)
- SigClust Real Data Results (2)
- SigClust Real Data Results (3)
- SigClust Real Data Results (4)
- SigClust Real Data Results (5)
- SigClust Overview
- SigClust Overview (2)
- SigClust Overview (3)
- SigClust Overview (4)
- SigClust Open Problems
- HDLSS Asymptotics
- HDLSS Asymptotics (2)
- HDLSS Asymptotics (3)
- HDLSS Asymptotics (4)
- HDLSS Asymptotics (5)
- HDLSS Asymptotics (6)
- HDLSS Asymptotics (7)
- HDLSS Asymptotics (8)
- HDLSS Asymptotics (9)
- HDLSS Asymptotics (10)
- HDLSS Asymptotics (11)
- HDLSS Asymptotics (12)
- HDLSS Asymptotics (13)
- HDLSS Asymptotics (14)
- HDLSS Asymptotics (15)
- HDLSS Asymptotics (16)
- HDLSS Asymptotics Simple Paradoxes
- HDLSS Asymptotics Simple Paradoxes (2)
- HDLSS Asymptotics Simple Paradoxes (3)
- HDLSS Asymptotics Simple Paradoxes (4)
- HDLSS Asymptotics Simple Paradoxes (5)
- HDLSS Asymptotics Simple Paradoxes (6)
- HDLSS Asymptotics Simple Paradoxes (7)
- HDLSS Asymptotics Simple Paradoxes (8)
- HDLSS Asymptotics Simple Paradoxes (9)
- HDLSS Asymptotics Simple Paradoxes (10)
- HDLSS Asymptotics Simple Paradoxes (11)
- HDLSS Asymptotics Simple Paradoxes (12)
- HDLSS Asymptotics Simple Paradoxes (13)
- HDLSS Asymptotics Simple Paradoxes (14)
- HDLSS Asymptotics Simple Paradoxes (15)
- HDLSS Asymptotics Simple Paradoxes (16)
- HDLSS Asymptotics Simple Paradoxes (17)
- HDLSS Asymptotics Simple Paradoxes (18)
- HDLSS Asymptotics Simple Paradoxes (19)
- HDLSS Asymptotics Simple Paradoxes (20)
- HDLSS Asymptotics Simple Paradoxes (21)
- HDLSS Asymptotics Simple Paradoxes (22)
- HDLSS Asymptotics Simple Paradoxes (23)
- HDLSS Asymptotics Simple Paradoxes (24)
- HDLSS Asymptotics Simple Paradoxes (25)
- HDLSS Asymptotics Simple Paradoxes (26)
- HDLSS Asymptotics Simple Paradoxes (27)
- HDLSS Asymptotics Simple Paradoxes (28)
- HDLSS Asymptotics Simple Paradoxes (29)
- HDLSS Asymptotics Simple Paradoxes (30)
- HDLSS Asymptotics Simple Paradoxes (31)
- HDLSS Asyrsquos Geometrical Representrsquon
- HDLSS Asyrsquos Geometrical Representrsquon (2)
- HDLSS Asyrsquos Geometrical Representrsquon (3)
- HDLSS Asyrsquos Geometrical Representrsquon (4)
- HDLSS Asyrsquos Geometrical Representrsquon (5)
- HDLSS Asyrsquos Geometrical Representrsquon (6)
- HDLSS Asyrsquos Geometrical Representrsquon (7)
- HDLSS Asyrsquos Geometrical Representrsquon (8)
- HDLSS Asyrsquos Geometrical Representrsquon (9)
- HDLSS Asyrsquos Geometrical Representrsquon (10)
- HDLSS Asyrsquos Geometrical Representrsquon (11)
- HDLSS Asyrsquos Geometrical Represenrsquotion
- HDLSS Asyrsquos Geometrical Represenrsquotion (2)
- HDLSS Asyrsquos Geometrical Represenrsquotion (3)
- HDLSS Asyrsquos Geometrical Represenrsquotion (4)
- HDLSS Asyrsquos Geometrical Represenrsquotion (5)
- HDLSS Asyrsquos Geometrical Represenrsquotion (6)
-