sparse representation and efcient inference for sdss spectra · methods: principal components...
TRANSCRIPT
Sparse Representation and EfficientInference for SDSS Spectra
Joseph [email protected]
Department of StatisticsCarnegie Mellon University
INternational Computational Astrostatisticswww.incagroup.org
ADA V Presentation 09/05/2008 – p.1
MotivationAstronomical spectra are data in very high dimensionalspace
Our Goal: Estimate physically-important parameters fora large database of galaxies through their spectra
ADA V Presentation 09/05/2008 – p.2
MotivationAstronomical spectra are data in very high dimensionalspaceSloan Digital Sky Survey (SDSS) spectra: 3850wavelength bins per spectrum
4000 4500 5000 5500 6000 6500 7000 7500 8000 85000
5
10
15
Wavelength
4000 4500 5000 5500 6000 6500 7000 7500 8000 85000
5
10
15
Wavelengthflux
(10e
−17
erg/
cm/s
2 /Ang
)
4000 4500 5000 5500 6000 6500 7000 7500 8000 85000
20
40
60
80
Wavelength
Our Goal: Estimate physically-important parameters fora large database of galaxies through their spectra
ADA V Presentation 09/05/2008 – p.2
MotivationAstronomical spectra are data in very high dimensionalspaceOur Goal: Estimate physically-important parameters fora large database of galaxies through their spectra
ADA V Presentation 09/05/2008 – p.2
OutlineMethods
Principal Components AnalysisDiffusion MapAdaptive Regression
ApplicationsRedshift predictionAge and Metallicity estimation
Future Work
ADA V Presentation 09/05/2008 – p.3
Intro: Dimensionality ReductionIn high dimensions, determining relationships betweendata points is difficult (curse of dimensionality)Explore whether there is a simple, lower-dimensionalrepresentation of the data.Make statistical inferences with this lower-dimensionalparameterization of the data.In reducing dimensionality,
What properties of the original data set do we wishto retain?
ADA V Presentation 09/05/2008 – p.4
Methods: Principal Components Analysis (PCA)
For data X (n x p), compute Σ = Cov(X). Performspectral decomposition of Σ:
UT ΣU = Λ
Λ is the diagonal matrix of eigenvalues of ΣU is a matrix of the corresponding eigenvectorsProject the data X onto the eigenvectors correspondingto the largest m p eigenvaluesPCA projects original data to the m-dimensionalhyperplane that maximizes variance of the new datalow-dimensional PCA representation optimally capturesall Euclidean distances between original data points
ADA V Presentation 09/05/2008 – p.5
PCA Example
Elements of Statistical Learning,Hastie, Tibshirani, and Friedman, pg. 488
ADA V Presentation 09/05/2008 – p.6
Drawbacks to PCA
PCA is a global technique:1) fails to capture local heterogeneous features within a
datapoint2) preserves all (Euclidean) distances between
datapointsIs linear; projects data onto a hyperplaneIt is not robust to noise or outliers
Better strategy: capture the intrinsic geometry of thedata set (i.e. the connectivity of the data)
ADA V Presentation 09/05/2008 – p.7
Drawbacks to PCA
PCA is a global technique:1) fails to capture local heterogeneous features within a
datapoint2) preserves all (Euclidean) distances between
datapointsIs linear; projects data onto a hyperplaneIt is not robust to noise or outliersBetter strategy: capture the intrinsic geometry of thedata set (i.e. the connectivity of the data)
ADA V Presentation 09/05/2008 – p.7
Methods: Diffusion MapsGoal: learn the intrinsic (lower-dimensional) geometryof a dataset in R
p
In high dimensions, distance measures are usually onlyrelevant locally.
Diffusion Map Idea:Take a distance measure that makes sense locally,e.g. s(x,y) = ‖x − y‖2
Use s to define a distance metric, D(x,y), thatreflects the connectivity of data points x and y.Preserve this distance when reducing dimensionality.
ADA V Presentation 09/05/2008 – p.8
Methods: Diffusion MapsGoal: learn the intrinsic (lower-dimensional) geometryof a dataset in R
p
In high dimensions, distance measures are usually onlyrelevant locally.Diffusion Map Idea:
Take a distance measure that makes sense locally,e.g. s(x,y) = ‖x − y‖2
Use s to define a distance metric, D(x,y), thatreflects the connectivity of data points x and y.Preserve this distance when reducing dimensionality.
ADA V Presentation 09/05/2008 – p.8
Diffusion Maps: FormulationStart with a local weight matrix,
w(x,y) = exp
(
−s2(x,y)
ε
)
where ε is a tuning parameter.The diffusion kernel is
p1(x,y) =w(x,y)
∑
zw(x, z)
p1(x,y) can be interpreted as the transition probability ofa Markov random walk on the data.
ADA V Presentation 09/05/2008 – p.9
Diffusion Maps: FormulationRunning this Markov chain forward in time, t, isequivalent to propogating the local influence of eachdata point with its neighborsDefine the diffusion distance (Coifman and Lafon[2006]) Dt for time t > 0 as
D2t (x,y) = ‖pt(x, ·) − pt(y, ·)‖
22 =
∑
z∈Ω
(pt(x, z) − pt(y, z))2
Dt(x,y) is closely related to pt(x,y)
The diffusion distance, Dt, between 2 points is relatedto their connectivity.
ADA V Presentation 09/05/2008 – p.10
Ex: Diffusion vs. Euclidean Distances
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
A
B
Diffusion distance more accurately describes the amount ofdissimilarity between A and B (Lafon and Lee [2006]).
ADA V Presentation 09/05/2008 – p.11
Diffusion Maps: Dimension Reduction
Perform spectral decomposition of P t:
pt(x,y) =∑
j≥0
λtjψj(x)φj(y)
where as in PCA, we retain the m eigenvectorscorresponding to the m largest nontrivial eigenvaluesThe m-dimensional diffusion map of the data is
Ψt : x 7→ (λt1ψ1(x), λt
2ψ2(x), · · · , λtmψm(x))
where D2t (x,y) ' ||Ψt(x) − Ψt(y)||2
Instead of capturing Euclidean distances (PCA),low-dim. representation captures diffusion distances
ADA V Presentation 09/05/2008 – p.12
Results: z predictionSample of 3846 SDSS galaxy spectraColored by log(zSDSS)
−8 −6 −4 −2 0 2 4
x 10−3−5
0
5
x 10−3
−7
−6
−5
−4
−3
−2
−1
0
1
2x 10−3
PC Map, SDSS spectra, L2 dists, log(redshift) in color
PC1
PC2
PC3
−1.3
−1.2
−1.1
−1
−0.9
−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−3 −2 −1 0 1 2 3 4
x 10−9−5
0
5
10
15
x 10−14
−1.5
−1
−0.5
0
0.5
1
1.5x 10−16
λ1 ψ1
Diffusion Map, SDSS spectra, log(redshift) in color
λ2 ψ2
λ 3 ψ
3−1.3
−1.2
−1.1
−1
−0.9
−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
PC Map Diffusion Map
ADA V Presentation 09/05/2008 – p.13
Methods: Adaptive RegressionAny smooth regression function r can be written as
r(x) =∞
∑
j=1
βjψj(x)
where ψ1, ψ2, ... form an orthonormal basis, x ∈ Rp
Generally, choice of ψj is arbitrary (e.g. high-dimensionalcosine basis, wavelets, curvelets, etc.)
Regression function estimate: r(x) =∑J
j=1βjψj(x)
Adaptive framework: use the learned orthogonal functionestimates ψ1, ..., ψm for X ⊂ R
p from PCA or diffusionmaps.
ADA V Presentation 09/05/2008 – p.14
Results: z predictionCV Pred. Risk versus J
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
# of basis functions
CV P
redi
ctio
n Ri
sk
PCDiffusion Map
Find PC and diffusionmap coefficients forall 3846 spectraRegress only onspectra with zSDSS CL> 0.99 (2796, 73%)Choose diffusion mapε that minimizes CVprediction risk.
ADA V Presentation 09/05/2008 – p.15
Results: z predictionPredicted log(1 + z) vs. log(1 + zSDSS)
Optimal model: 20 diffusion map coordinates
0.0 0.1 0.2 0.3
0.0
0.1
0.2
0.3
Redshift predictions for SDSS z CL > 0.99
SDSS log(1+z)
10−f
old
CV p
redi
cted
log(
1+z)
2796 galaxies
0.0 0.1 0.2 0.3
0.0
0.1
0.2
0.3
Redshift predictions for SDSS z CL < 0.99
SDSS log(1+z)
pred
icted
log(
1+z)
0.95<CL<0.99 (663 galaxies)0.9<CL<0.95 (246 galaxies)CL<0.9 (141 galaxies)
zSDSS CL > 0.99 zSDSS CL ≤ 0.99ADA V Presentation 09/05/2008 – p.16
Problem 2: Age & MetallicityOur main goal: Estimate age and metallicity of SDSSgalaxiesTraining set: 1146 synthetic spectra fromBruzual and Charlot [2003] population synthesismodels (Cid Fernandes et al. [2005],www.starlight.ufsc.br)For each synthetic spectrum, we know age & Z.
ADA V Presentation 09/05/2008 – p.17
Maps for synthetic spectraMaps for simulated spectra, color is log(age)
−2 −1 0 1 2 3 4
x 10−3
−2
0
2
x 10−3
−1
−0.5
0
0.5
1
1.5x 10−3
PC Map, SDSS spectra, L2 dists, log(age) in color
PC1
PC2
PC3
6.5
7
7.5
8
8.5
9
9.5
10
−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06−2
0
2
x 10−4
−1.5
−1
−0.5
0
0.5
1
1.5x 10−7
λ1 ψ1
Diffusion Map, STARLIGHT spectra, L2 dists, log10(age) in color
λ2 ψ2
λ 3 ψ
3
6.5
7
7.5
8
8.5
9
9.5
10
PC Map Diffusion Map
ADA V Presentation 09/05/2008 – p.18
Results: Age, Z predictionAge Metallicity
7 8 9 10
78
910
Starlight Actual vs. Fitted log(Age), best fit (eps=0.02, 188 vars. by CV)
fitted log(Age)
log(
Age)
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5
−4.0
−3.5
−3.0
−2.5
−2.0
−1.5
Starlight Actual vs. Fitted log(Z), best fit (eps=0.02, 188 vars. by CV)
fitted log(Z)
log(
Z)
ADA V Presentation 09/05/2008 – p.19
Future WorkRedshift Prediction
Apply redshift prediction techniques to more SDSSdataPredict redshift using continuum-subtracted spectra
Age & Metallicity EstimationPredict ages and metallicities for SDSS galaxiesCompare results to template-matching method ofCid Fernandes et al. [2005]
ADA V Presentation 09/05/2008 – p.20
Conclusions1. Diffusion map is an effective and efficient dimensionality
reduction technique2. Outperforms linear methods (PCA) in regression tasks
on real astronomical data sets3. Adaptive regression used to perform statistical
inference and choose optimal dimensionality4. Methods applicable for wide variety of astronomical
data sets (e.g. spectra, images, multi-wavelength datasets, etc., etc.) for variety of tasks (e.g. regression,classification, clustering, etc.)
ADA V Presentation 09/05/2008 – p.21
ReferencesG. Bruzual and S. Charlot. Stellar population synthesis at the resolution of 2003. MNRAS,
344:1000–1028, Oct. 2003.
R. Cid Fernandes, A. Mateus, L. Sodré, G. Stasinska, and J. M. Gomes. Semi-empiricalanalysis of Sloan Digital Sky Survey galaxies - I. Spectral synthesis method. MNRAS,358:363–378, Apr. 2005.
R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,21(1):5–30, July 2006. URLhttp://dx.doi.org/10.1016\%2Fj.acha.2006.04.006 .
S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A unified framework fordimensionality reduction, graph partitioning, and data set parameterization. IEEETransactions on Pattern Analysis and Machine Intelligence, 28(9), September 2006.
ADA V Presentation 09/05/2008 – p.22
Appendix: Epsilon Selection
−4.0 −3.5 −3.0 −2.5
0.15
0.20
0.25
0.30
0.35
0.40
Epsilon selection
log10(Epsilon)
min
CV
pred
ictio
n ris
k
ADA V Presentation 09/05/2008 – p.23
Appendix: Eigenvalue Dropoff
0 5 10 15 20 250
0.05
0.1
0.15
0.2
0.25
0.3
0.35
eige
nval
ue
Eigenvalue dropoff, PC Map, SDSS spectra
0 5 10 15 20 250
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
eige
nval
ue
Eigenvalue dropoff, Diffusion Map, SDSS spectra
PCA Diffusion Mapping
ADA V Presentation 09/05/2008 – p.24
Appendix: Kernel Regression
−1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0
−1.2
−1.0
−0.8
−0.6
−0.4
−0.2
0.0
a) Linear regression 3 diffusion coords. (CV risk=36.6)
CV predicted log(z)
SDSS
log(
z)
−1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0
−1.2
−1.0
−0.8
−0.6
−0.4
−0.2
0.0
b) Kernel regression 3 diffusion coords. (CV risk=36.2)
CV predicted log(z)
SDSS
log(
z)
ADA V Presentation 09/05/2008 – p.25
Appendix: SDSS-SL calibrationThe following steps are performed to compare syntheticand SDSS spectra:1. SDSS spectra are brought to rest-frame wavelengths
(based on SDSS estimate of z)2. Emission lines in SDSS spectra are masked out3. SL spectra are rebinned to match SDSS binning4. SL and SDSS spectra are normalized to sum to 1 in
overlapping wavelength range5. L2 distance computed between all SL and SDSS
spectraa. SDSS flux errors accounted for, orb. SDSS spectra smoothed first
ADA V Presentation 09/05/2008 – p.26