sparse representation and efcient inference for sdss spectra · methods: principal components...

Sparse Representation and EfficientInference for SDSS Spectra

Joseph [email protected]

Department of StatisticsCarnegie Mellon University

INternational Computational Astrostatisticswww.incagroup.org

ADA V Presentation 09/05/2008 – p.1

MotivationAstronomical spectra are data in very high dimensionalspace

Our Goal: Estimate physically-important parameters fora large database of galaxies through their spectra


MotivationAstronomical spectra are data in very high dimensionalspaceSloan Digital Sky Survey (SDSS) spectra: 3850wavelength bins per spectrum

4000 4500 5000 5500 6000 6500 7000 7500 8000 85000

5

10

15

Wavelength

4000 4500 5000 5500 6000 6500 7000 7500 8000 85000

5

10

15

Wavelengthflux

(10e

−17

erg/

cm/s

2 /Ang

)

4000 4500 5000 5500 6000 6500 7000 7500 8000 85000

20

40

60

80

Wavelength

Our Goal: Estimate physically-important parameters fora large database of galaxies through their spectra


MotivationAstronomical spectra are data in very high dimensionalspaceOur Goal: Estimate physically-important parameters fora large database of galaxies through their spectra


OutlineMethods

Principal Components AnalysisDiffusion MapAdaptive Regression

ApplicationsRedshift predictionAge and Metallicity estimation

Future Work


Intro: Dimensionality ReductionIn high dimensions, determining relationships betweendata points is difficult (curse of dimensionality)Explore whether there is a simple, lower-dimensionalrepresentation of the data.Make statistical inferences with this lower-dimensionalparameterization of the data.In reducing dimensionality,

What properties of the original data set do we wishto retain?


Methods: Principal Components Analysis (PCA)

For data X (n x p), compute Σ = Cov(X). Performspectral decomposition of Σ:

UT ΣU = Λ

Λ is the diagonal matrix of eigenvalues of ΣU is a matrix of the corresponding eigenvectorsProject the data X onto the eigenvectors correspondingto the largest m p eigenvaluesPCA projects original data to the m-dimensionalhyperplane that maximizes variance of the new datalow-dimensional PCA representation optimally capturesall Euclidean distances between original data points


PCA Example

Elements of Statistical Learning,Hastie, Tibshirani, and Friedman, pg. 488


Drawbacks to PCA

PCA is a global technique:1) fails to capture local heterogeneous features within a

datapoint2) preserves all (Euclidean) distances between

datapointsIs linear; projects data onto a hyperplaneIt is not robust to noise or outliers

Better strategy: capture the intrinsic geometry of thedata set (i.e. the connectivity of the data)


Drawbacks to PCA

PCA is a global technique:1) fails to capture local heterogeneous features within a

datapoint2) preserves all (Euclidean) distances between

datapointsIs linear; projects data onto a hyperplaneIt is not robust to noise or outliersBetter strategy: capture the intrinsic geometry of thedata set (i.e. the connectivity of the data)


Methods: Diffusion MapsGoal: learn the intrinsic (lower-dimensional) geometryof a dataset in R

p

In high dimensions, distance measures are usually onlyrelevant locally.

Diffusion Map Idea:Take a distance measure that makes sense locally,e.g. s(x,y) = ‖x − y‖2

Use s to define a distance metric, D(x,y), thatreflects the connectivity of data points x and y.Preserve this distance when reducing dimensionality.


Methods: Diffusion MapsGoal: learn the intrinsic (lower-dimensional) geometryof a dataset in R

p

In high dimensions, distance measures are usually onlyrelevant locally.Diffusion Map Idea:

Take a distance measure that makes sense locally,e.g. s(x,y) = ‖x − y‖2

Use s to define a distance metric, D(x,y), thatreflects the connectivity of data points x and y.Preserve this distance when reducing dimensionality.


Diffusion Maps: FormulationStart with a local weight matrix,

w(x,y) = exp

(

−s2(x,y)

ε

)

where ε is a tuning parameter.The diffusion kernel is

p1(x,y) =w(x,y)

∑

zw(x, z)

p1(x,y) can be interpreted as the transition probability ofa Markov random walk on the data.


Diffusion Maps: FormulationRunning this Markov chain forward in time, t, isequivalent to propogating the local influence of eachdata point with its neighborsDefine the diffusion distance (Coifman and Lafon[2006]) Dt for time t > 0 as

D2t (x,y) = ‖pt(x, ·) − pt(y, ·)‖

22 =

∑

z∈Ω

(pt(x, z) − pt(y, z))2

Dt(x,y) is closely related to pt(x,y)

The diffusion distance, Dt, between 2 points is relatedto their connectivity.


Ex: Diffusion vs. Euclidean Distances

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

A

B

Diffusion distance more accurately describes the amount ofdissimilarity between A and B (Lafon and Lee [2006]).


Diffusion Maps: Dimension Reduction

Perform spectral decomposition of P t:

pt(x,y) =∑

j≥0

λtjψj(x)φj(y)

where as in PCA, we retain the m eigenvectorscorresponding to the m largest nontrivial eigenvaluesThe m-dimensional diffusion map of the data is

Ψt : x 7→ (λt1ψ1(x), λt

2ψ2(x), · · · , λtmψm(x))

where D2t (x,y) ' ||Ψt(x) − Ψt(y)||2

Instead of capturing Euclidean distances (PCA),low-dim. representation captures diffusion distances


Results: z predictionSample of 3846 SDSS galaxy spectraColored by log(zSDSS)

−8 −6 −4 −2 0 2 4

x 10−3−5

0

5

x 10−3

−7

−6

−5

−4

−3

−2

−1

0

1

2x 10−3

PC Map, SDSS spectra, L2 dists, log(redshift) in color

PC1

PC2

PC3

−1.3

−1.2

−1.1

−1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−3 −2 −1 0 1 2 3 4

x 10−9−5

0

5

10

15

x 10−14

−1.5

−1

−0.5

0

0.5

1

1.5x 10−16

λ1 ψ1

Diffusion Map, SDSS spectra, log(redshift) in color

λ2 ψ2

λ 3 ψ

3−1.3

−1.2

−1.1

−1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

PC Map Diffusion Map


Methods: Adaptive RegressionAny smooth regression function r can be written as

r(x) =∞

∑

j=1

βjψj(x)

where ψ1, ψ2, ... form an orthonormal basis, x ∈ Rp

Generally, choice of ψj is arbitrary (e.g. high-dimensionalcosine basis, wavelets, curvelets, etc.)

Regression function estimate: r(x) =∑J

j=1βjψj(x)

Adaptive framework: use the learned orthogonal functionestimates ψ1, ..., ψm for X ⊂ R

p from PCA or diffusionmaps.


Results: z predictionCV Pred. Risk versus J

0 50 100 150 200

0.2

0.3

0.4

0.5

0.6

0.7

# of basis functions

CV P

redi

ctio

n Ri

sk

PCDiffusion Map

Find PC and diffusionmap coefficients forall 3846 spectraRegress only onspectra with zSDSS CL> 0.99 (2796, 73%)Choose diffusion mapε that minimizes CVprediction risk.


Results: z predictionPredicted log(1 + z) vs. log(1 + zSDSS)

Optimal model: 20 diffusion map coordinates

0.0 0.1 0.2 0.3

0.0

0.1

0.2

0.3

Redshift predictions for SDSS z CL > 0.99

SDSS log(1+z)

10−f

old

CV p

redi

cted

log(

1+z)

2796 galaxies

0.0 0.1 0.2 0.3

0.0

0.1

0.2

0.3

Redshift predictions for SDSS z CL < 0.99

SDSS log(1+z)

pred

icted

log(

1+z)

0.95<CL<0.99 (663 galaxies)0.9<CL<0.95 (246 galaxies)CL<0.9 (141 galaxies)

zSDSS CL > 0.99 zSDSS CL ≤ 0.99ADA V Presentation 09/05/2008 – p.16

Problem 2: Age & MetallicityOur main goal: Estimate age and metallicity of SDSSgalaxiesTraining set: 1146 synthetic spectra fromBruzual and Charlot [2003] population synthesismodels (Cid Fernandes et al. [2005],www.starlight.ufsc.br)For each synthetic spectrum, we know age & Z.


Maps for synthetic spectraMaps for simulated spectra, color is log(age)

−2 −1 0 1 2 3 4

x 10−3

−2

0

2

x 10−3

−1

−0.5

0

0.5

1

1.5x 10−3

PC Map, SDSS spectra, L2 dists, log(age) in color

PC1

PC2

PC3

6.5

7

7.5

8

8.5

9

9.5

10

−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06−2

0

2

x 10−4

−1.5

−1

−0.5

0

0.5

1

1.5x 10−7

λ1 ψ1

Diffusion Map, STARLIGHT spectra, L2 dists, log10(age) in color

λ2 ψ2

λ 3 ψ

3

6.5

7

7.5

8

8.5

9

9.5

10

PC Map Diffusion Map


Results: Age, Z predictionAge Metallicity

7 8 9 10

78

910

Starlight Actual vs. Fitted log(Age), best fit (eps=0.02, 188 vars. by CV)

fitted log(Age)

log(

Age)

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5

−4.0

−3.5

−3.0

−2.5

−2.0

−1.5

Starlight Actual vs. Fitted log(Z), best fit (eps=0.02, 188 vars. by CV)

fitted log(Z)

log(

Z)


Future WorkRedshift Prediction

Apply redshift prediction techniques to more SDSSdataPredict redshift using continuum-subtracted spectra

Age & Metallicity EstimationPredict ages and metallicities for SDSS galaxiesCompare results to template-matching method ofCid Fernandes et al. [2005]


Conclusions1. Diffusion map is an effective and efficient dimensionality

reduction technique2. Outperforms linear methods (PCA) in regression tasks

on real astronomical data sets3. Adaptive regression used to perform statistical

inference and choose optimal dimensionality4. Methods applicable for wide variety of astronomical

data sets (e.g. spectra, images, multi-wavelength datasets, etc., etc.) for variety of tasks (e.g. regression,classification, clustering, etc.)


ReferencesG. Bruzual and S. Charlot. Stellar population synthesis at the resolution of 2003. MNRAS,

344:1000–1028, Oct. 2003.

R. Cid Fernandes, A. Mateus, L. Sodré, G. Stasinska, and J. M. Gomes. Semi-empiricalanalysis of Sloan Digital Sky Survey galaxies - I. Spectral synthesis method. MNRAS,358:363–378, Apr. 2005.

R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,21(1):5–30, July 2006. URLhttp://dx.doi.org/10.1016\%2Fj.acha.2006.04.006 .

S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A unified framework fordimensionality reduction, graph partitioning, and data set parameterization. IEEETransactions on Pattern Analysis and Machine Intelligence, 28(9), September 2006.


http://dx.doi.org/10.1016%2Fj.acha.2006.04.006

Appendix: Epsilon Selection

−4.0 −3.5 −3.0 −2.5

0.15

0.20

0.25

0.30

0.35

0.40

Epsilon selection

log10(Epsilon)

min

CV

pred

ictio

n ris

k


Appendix: Eigenvalue Dropoff

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

eige

nval

ue

Eigenvalue dropoff, PC Map, SDSS spectra

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

eige

nval

ue

Eigenvalue dropoff, Diffusion Map, SDSS spectra

PCA Diffusion Mapping


Appendix: Kernel Regression

−1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0

−1.2

−1.0

−0.8

−0.6

−0.4

−0.2

0.0

a) Linear regression 3 diffusion coords. (CV risk=36.6)

CV predicted log(z)

SDSS

log(

z)

−1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0

−1.2

−1.0

−0.8

−0.6

−0.4

−0.2

0.0

b) Kernel regression 3 diffusion coords. (CV risk=36.2)

CV predicted log(z)

SDSS

log(

z)


Appendix: SDSS-SL calibrationThe following steps are performed to compare syntheticand SDSS spectra:1. SDSS spectra are brought to rest-frame wavelengths

(based on SDSS estimate of z)2. Emission lines in SDSS spectra are masked out3. SL spectra are rebinned to match SDSS binning4. SL and SDSS spectra are normalized to sum to 1 in

overlapping wavelength range5. L2 distance computed between all SL and SDSS

spectraa. SDSS flux errors accounted for, orb. SDSS spectra smoothed first


sparse representation and efcient inference for sdss spectra · methods: principal components...

Documents