coryn bailer-jones // · 2009. 4. 1. · a gaussian and an approximate formula that gives an...

24
C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 1 Applications of machine learning in astronomy Introduction Coryn Bailer-Jones http://www.mpia.de/homes/calj/amla.html

Upload: others

Post on 10-Mar-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 1

Applications of machine learning in astronomy

Introduction

Coryn Bailer-Joneshttp://www.mpia.de/homes/calj/amla.html

Page 2: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy

Today

what is machine learning? goals of seminar format of seminars decide upon and assign topics

2

Page 3: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 3

Example 1: spatial clusteringar

Xiv

:0710.3

691v1 [a

stro

-ph]

19 O

ct 2

007

Mon. Not. R. Astron. Soc. 000, 1–10 (2007) Printed 2 February 2008 (MN LATEX style file v2.2)

A MST algorithm for source detection in γ-ray images

Riccardo Campana,1! Enrico Massaro,1,2! Dario Gasparrini,2,3,4 Sara Cutini,2,3,4

and Andrea Tramacere5

1 Department of Physics, University of Rome “La Sapienza”, Piazzale A. Moro 2, I-00185, Rome, Italy2 ASI Science Data Center (ASDC), c/o ESRIN, Via G. Galilei, I-00044, Frascati, Italy3 Department of Physics, University of Perugia, Via A. Pascoli, I-06123, Perugia, Italy4 INAF personnell resident at ASDC under ASI contract I/024/05/15 Stanford Linear Accelerator Center, 2575 Sand Hill Road, Menlo Park, CA-94025, USA

Accepted 2007 October 19. Received 2007 September 25; in original form 2007 June 05.

ABSTRACT

We developed a source detection algorithm based on the Minimal Spanning Tree(MST), that is a graph-theoretical method useful for finding clusters in a given set ofpoints. This algorithm is applied to γ-ray bidimensional images where the points corre-spond to the arrival direction of photons, and the possible sources are associated withthe regions where they clusterize. Some filters to select these clusters and to reducethe spurious detections are introduced. An empirical study of the statistical propertiesof MST on random fields is carried in order to derive some criteria to estimate thebest filter values. We introduce also two parameters useful to verify the goodness ofcandidate sources. To show how the MST algorithm works in the practice, we presentan application to an EGRET observation of the Virgo field, at high galactic latitudeand with a low and rather uniform background, in which several sources are detected.

Key words: gamma rays: observations – methods: data analysis

1 INTRODUCTION

Telescopes for satellite-based high energy γ-ray astronomydetect individual photons by means of the electron-positronpair that they generate through the detector. From the pairtrajectories it is possible to reconstruct the original direc-tion of the photon with an uncertainty that decreases withthe energy, from a few degrees below 100 MeV to less thana degree above 1 GeV. This technique was applied to thepast γ-ray observatories SAS-2 (Fichtel et al. 1975), COS-B(Bennett 1990) and EGRET-CGRO (Kanbach et al. 1988;Thompson et al. 1993), all equipped with spark chambers.Pair tracking is also used in the current AGILE mission (Ta-vani et al. 2006) and in the LAT telescope on board the nextGLAST mission, both employing silicon microstrip detec-tors (Gehrels et al. 1999). The resulting product is an imagewhere each photon is associated with a direction in the sky:discrete sources thus correspond to regions in which a num-ber of photons higher than those found in the surroundingsare observed. When the size of this region is consistent withthe instrumental Point Spread Function the source is con-

! E-mail addresses: [email protected] [email protected]

sidered as point-like, otherwise it can be extended or a groupof near sources.

Various algorithms are applied to the detection of point-like or extended sources in γ-ray astronomy: the most exten-sively used one is based on the Maximum Likelihood (Mat-tox et al. 1996), whereas others based on Wavelet Trans-form analysis (Damiani et al. 1997), Optimal Filter (Sanz etal. 2001), Scale-Adaptive Filter (Herranz et al. 2002), etc.,were variously applied to real and simulated data to studytheir performances. Some of them are based on deconvolu-tion techniques of the instrumental Point Spread Function(PSF). Many methods work directly on the pixellated im-ages, i.e. count or intensity maps. Other methods search forclusters in the arrival directions of photon that, if statisti-cally significant, are considered an indication of a source.

The approach considered by us is essentially a clustersearch based on a minimal spanning tree (MST) algorithm.This technique has its root in graph theory, and highlightsthe topological pattern of connectedness of the detected pho-tons. Given a graph G(V, E), where V is the set of vertices(or nodes) and E is the set of weighted edges connectingthem, a MST (Kruskal 1956; Prim 1957; Zahn 1971) is thetree (a subgraph of G without closed circuits) that connectsall the points with the minimum total weight, defined as thesum of the weight of each tree’s edge. In a data set con-

Page 4: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 4

A MST algorithm for source detection in γ-ray images 3

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure 1. Upper left: A set of 500 random-generated points, with two simulated sources. Upper right: The Minimal Spanning Treebetween these points. Lower left: Cluster selection after separation with Λc = 1.3Λm and elimination with Nc = 7. Lower right: clusterselection with the filters Λc = Λm, Nc = 10. The added “sources”, at coordinates (0.3, 0.3) and (0.7, 0.7), are marked by the diamond.Circles are centered on the centroids of the remaining sub-trees (square) and have a radii equal to the distance of the farthest node inthe sub-tree. The dot is the refined source position, see text for details.

with a random field (upper panel) and the same frame withfive sources added (lower panel): in the latter case there is aclear excess of short distances (within the clusters that markthe sources) and of long distances (between the clusters)with respect to the random case, and the histogram showsan evident left asymmetry.

A useful indicator for the presence of sources is the meanvalue of the MST length Λm. Earlier investigations (Gilbert1965) found that the total length of a random MST is pro-portional to

p

(ANtot) where A is the field area and Ntot isthe total number of points. A theoretical upper limit to theproportionality constant was found to be 2−1/2

! 0.70. OurMonte Carlo simulations showed that the constant value israther ! 0.65. Therefore the mean length for a random-fieldMST is:

Λm ! 0.65 ×

r

ANtot

(1)

Thus, if the mean length for a field deviates from this value,it is an indicator of non-random clusterization, i.e. of thepresence of sources.

Another test for the occurrence of sources is the evalu-ation of the skewness coefficient β3 of the distribution f(x).In the two cases of Fig. 2 we found β3 equal to 0.16 and0.46; the higher value is due to the decrease of the meanlength Λm and to the occurrence of x values greater than≈ 2.5 when sources are present. From our simulations we

found that β3 higher than ∼ 0.2 can be considered a goodindicator for the presence of sources.

For an accurate study of the edge length distribution itis useful to have a simple analytical formula to be appliedin the computation. Since theoretical works on this subjectare not easily available in the astronomical literature, wefollowed a numerical approach.

First we generated a pure random frame containing 106

points to smooth the fluctuations in the histogram and theresulting frequency plot is given in Fig. 3. Note that, likein Fig. 2, it has a well defined mode, a small skewness andvery small tail for x > 2. Its shape is not, therefore, that ofa Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimitedinterval [0, +∞), is a Rayleigh distribution, suppressed atlarge x by a factor similar to that of a Fermi-Dirac (FD)distribution:

f(x) = Kxσ2

exp

−(x − µ)2

2σ2

ff

·1

exp`

x−cd

´

+ 1(2)

The parameters values were found by means of a nu-merical best fit and the resulting formula is:

f(x) =53

x exp

−(x + 0.3)2

2.16

ff

·1

exp`

x−1.810.156

´

+ 1(3)

with a maximum error with respect to the data less than 2%.

Page 5: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 5

4 R. Campana et al.

0.2

0.4

0.6

0.8

Fre

quen

cy

Random field

0 1 2 3 4Edge length / Mean length

0

0.2

0.4

0.6

0.8F

requen

cySource field

Figure 2. Upper panel: Histogram of the MST edge length, inunits of the mean length, for a random field with 1675 points.Lower panel: Histogram of the MST edge length, in units of themean length, for the same field in which some strong sources havebeen added. Note that there is a a large left-side asymmetry withrespect to the random field.

We computed the values of the mode, the median, the vari-ance and other moments from this distribution, and found0.892 and 0.952 respectively for the first two, a varianceequal to 0.208, whereas the skewness and the kurtosis are0.080 and 2.439, respectively.

Another fitting formula can be obtained from Pear-son distributions (Smart 1958, chap. 7), again suppressedat large x values by a FD factor:

f(x) = K

"

xa1

«ba1

# "

1 +a1

a2−

xa2

«ba2

#

·

·1

exp`

x−cd

´

+ 1

(4)

where K is a normalization factor, a1 is the value of themode, b and a2 are free parameters, c is the cut-off scale.Differently from Eq. (2), this distribution is defined in thefinite interval x ∈ [0, a1 + a2]. Considering that values of xlarger than 3.0 are extremely rare, we imposed the conditiona1 + a2 = 3.2 and evaluated the remaining parameters. Avery good fit was obtained for a1 = 0.91, b = 1.25, c = 1.8,d = 0.18 and the normalisation factor K = 0.7676.

The edge distribution can be useful for the choice ofthe separation parameter Λc. From Eq. (3) and Figure 3, wecan see that the choice of a low Xc = Λc/Λm, for instancethe value of 0.37, implies that about 90% of edges will be

0 1 2 3 4Edge length / Mean length

0

0.2

0.4

0.6

0.8

Fre

qu

ency

Figure 3. Histogram of the MST edge length frequency, in unitsof the mean length, for a random field with 106 points. Also plot-ted is Eq. (3).

eliminated, and the majority of remaining clusters will havea number of nodes too small to satisfy the elimination crite-ria. A good choice is to use a value close to unity: we foundfrom our simulations that the best range for Xc is between0.8 and 1.2, corresponding to the cumulative probabilitiesof 0.384 and 0.683, respectively. In fact, although the prob-ability to find an edge smaller than ∼ Λm is still large, it isunlikely that a high number of these edges will belong to asingle remaining cluster and they are therefore rejected bythe subsequent filtering.

3.2 Distribution of the number of sub-trees for a

given Λc in a random field

As shown by Di Gesu and Sacco (1983), the expected to-tal number of clusters obtained by cutting a random, 2-dimensional MST having Ntot points, at an edge length Λc,is given by:

N = 1 + (Ntot − 1) exp˘

−πΛ2cNtot/A

¯

(5)

where Ntot/A is the density of nodes, that according Eq.(1) is proportional to 1/Λ2

m. This is a monotonic decreasingfunction, and we verified with Monte Carlo simulations theconsistency of this result.

We used a different approach, directly based on the cal-culated mean edge length and considered another distribu-tion, useful for selecting the best Nc parameter, that of thenumber of clusters as a function of the number of nodes af-ter the application of a separation at the edge length Λc. Wecomputed several distributions in random fields via MonteCarlo simulations and found that they can be well describedby an exponential function:

T (Nn) = F (Xc) · Ntot · e−κ(Xc)Nn (6)

where T (Nn) is the total number of sub-trees having Nn

nodes each and Xc = Λc/Λm. Some examples, correspondingto different choices of the cut length Λc, are shown in Fig.4. We see how the mean number of big clusters decreaseswhen the cut length becomes smaller than the mean MSTedge length: that is explained by the fact that separating at

Page 6: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 6

Example 2: source classification with images1995PASP..107..279S

Page 7: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 7

1995PASP..107..279S

Train/test set: 2259 objects (60%CRs)

20 image parameters per source• 9 fluxes in a 3x3 grid• 3 based on PSF fit• 8 other statistics (e.g. sd in grid,

flux ratio at different radii)

Page 8: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 8

1995PASP..107..279S

• Classification tree: supervised learning method• Digital classification converted to classes using class

fractions in each “leaf” in training set

Page 9: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 9

1995PASP..107..279S

1995PASP..107..279S

• trees built stochastically: report sd over 10 runs• model complexity reported as mean number of nodes

Page 10: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 10

1995PASP..107..279S

• better results with a reduced feature set (!)• other alternative methods could be: logistic regression, LDA/

QDA, support vector machine, neural network

1995PASP..107..279S

Page 11: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy

Example 3: classification with kernel density estimation

X

EFFICIENT PHOTOMETRIC SELECTION OF QUASARS FROM THE SLOAN DIGITALSKY SURVEY: 100,000 z < 3 QUASARS FROM DATA RELEASE ONE

Gordon T. Richards,1 Robert C. Nichol,2 Alexander G. Gray,3 Robert J. Brunner,4 Robert H. Lupton,1

Daniel E. Vanden Berk,5 Shang Shan Chong,2 Michael A. Weinstein,6, 7 Donald P. Schneider,6

Scott F. Anderson,8 Jeffrey A. Munn,9 Hugh C. Harris,9 Michael A. Strauss,1

Xiaohui Fan,10 James E. Gunn,1 Zeljko Ivezic,1,8 Donald G. York,11,12

J. Brinkmann,13 and Andrew W. Moore3

Receivved 2004 June 1; accepted 2004 Auggust 19

ABSTRACT

We present a catalog of 100,563 unresolved, UV-excess (UVX) quasar candidates to g ¼ 21 from 2099 deg2 ofthe Sloan Digital Sky Survey (SDSS) Data Release One (DR1) imaging data. Existing spectra of 22,737 sourcesreveals that 22,191 (97.6%) are quasars; accounting for the magnitude dependence of this efficiency, we esti-mate that 95,502 (95.0%) of the objects in the catalog are quasars. Such a high efficiency is unprecedented inbroadband surveys of quasars. This ‘‘proof-of-concept’’ sample is designed to be maximally efficient, but still has94.7% completeness to unresolved, g P19:5, UVX quasars from the DR1 quasar catalog. This efficient andcomplete selection is the result of our application of a probability density type analysis to training sets thatdescribe the four-dimensional color distribution of stars and spectroscopically confirmed quasars in the SDSS.Specifically, we use a nonparametric Bayesian classification, based on kernel density estimation, to parameterizethe color distribution of astronomical sources—allowing for fast and robust classification. We further supplementthe catalog by providing photometric redshifts and matches to FIRST/VLA, ROSAT, and USNO-B sources.Future work needed to extend this selection algorithm to larger redshifts, fainter magnitudes, and resolvedsources is discussed. Finally, we examine some science applications of the catalog, particularly a tentative quasarnumber counts distribution covering the largest range in magnitude (14:2 < g < 21:0) ever made within theframework of a single quasar survey.

Subject headinggs: catalogs — quasars: general

Online material: machine-readable table

1. INTRODUCTION

Since the discovery of quasars (Schmidt 1963), ambitioussurveys (e.g., Schmidt & Green 1983; Foltz et al. 1987; Boyleet al. 2000; York et al. 2000) have caused the number ofknown quasars to rise from one to tens of thousands. Yet evenin this day of very large surveys and deep digital imaging, weare still far from identifying the more than 1.6 million z < 3quasars that are expected to fill the celestial sphere to g " 21.The problem lies not in covering enough of the sky to faint

enough magnitudes, but rather in the efficient separation ofquasars from other astronomical sources. Current algorithmsare typically more than 60% efficient for UV-excess (UVX)quasars to relatively bright magnitudes, but the selection ef-ficiency drops toward fainter magnitudes where the photo-metric errors are largest and most of the observable objectsreside. Further complicating the issue is the need to obtainspectra for each candidate.14 Thus, surveys of quasars wouldbenefit considerably from algorithms with selection efficien-cies that mitigate the need for confirming spectra. We describesuch an algorithm based on the photometric data of the SloanDigital Sky Survey (SDSS; York et al. 2000).

Optical surveys for quasars, including the SDSS, typicallyrely on simple color cuts in two or more colors to selectobjects that are likely to be quasars and to reject objects thatare unlikely to be quasars. The color selection part of theSDSS’s quasar algorithm (Richards et al. 2002) is essentiallytwo, three-dimensional color selection algorithms. One branchof the algorithm uses the ugri bands to identify UVX quasars,the other uses the griz bands to identify z > 3 quasars.

Another way to select quasars from imaging data is to useknown quasars to determine what regions of color space qua-sars occupy. Once these regions have been identified, spec-troscopic quasar target selection involves simply observingobjects from those regions of color space that are most likelyto yield quasars (or perhaps least likely to yield significantnumber of contaminants). At the beginning of the SDSS

14 X-ray to optical flux ratios may also suffice, but X-ray detections cantake just as long to obtain.

A

1 Princeton University Observatory, Peyton Hall, Princeton, NJ 08544.2 Department of Physics, Carnegie Mellon University, 5000 Forbes Ave-

nue, Pittsburgh, PA 15232.3 Robotics Institute, Carnegie Mellon University, 3128 Newell-Simon

Hall, 5000 Forbes Avenue, Pittsburgh, PA 15213-3891.4 Department of Astronomy, University of Illinois at Urbana-Champaign,

1002 West Green Street, Urbana, IL 61801-3080.5 Department of Physics and Astronomy, University of Pittsburgh, 3941

O’Hara Street, Pittsburgh, PA 15260.6 Department of Astronomy and Astrophysics, The Pennsylvania State

University, 525 Davey Laboratory, University Park, PA 16802.7 Department of Physics, Astronomy, and Geophysics, Connecticut Col-

lege, Box 5622, 270 Mohegan Avenue, New London, CT 06320.8 Department of Astronomy, University of Washington, Box 351580,

Seattle, WA 98195.9 US Naval Observatory, Flagstaff Station, P.O. Box 1149, Flagstaff, AZ

86002-1149.10 Steward Observatory, University of Arizona, 933 North Cherry Avenue,

Tucson, AZ 85721.11 Department of Astronomy and Astrophysics, The University of Chi-

cago, 5640 South Ellis Avenue, Chicago, IL 60637.12 Enrico Fermi Institute, The University of Chicago, 5640 South Ellis

Avenue, Chicago, IL 60637.13 Apache Point Observatory, P.O. Box 59, Sunspot, NM 88349.

257

The Astrophysical Journal Supplement Series, 155:257–269, 2004 December# 2004. The American Astronomical Society. All rights reserved. Printed in U.S.A.

Page 12: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy X(2003) catalog and that we fully expect that (1) the catalogwill be more incomplete with fainter magnitudes and that(2) the incompleteness of the whole catalog will be also be afunction of redshift and color. In particular, the fact that we donot include magnitude as an explicit parameter in our selectionalgorithm (other than limiting the magnitude ranges), and thefact that the colors of stars appear to be a stronger function ofmagnitude than the colors of quasars, means that there are

regions of color space where we are likely to be more in-complete as a result of our desire to be as efficient as possible.Utilization of the magnitudes (see x 5) in future applicationsof the algorithm should improve the completeness in suchregions.We have additionally tested the completeness of the algo-

rithm using simulations. Application of the algorithm to sim-ulated quasar colors constructed similarly to those of Fan

Fig. 2.—Color-color distribution of the 831,600 initial unresolved UVX sources. Blue dots and contours are those objects classified as stars. Black dots andcontours are objects classified as quasars. Red dots and contours are objects initially classified as quasars but were rejected by our cut on the stellar density. Contoursare a fraction of the peak in each class.

TABLE 1

NBC Quasar Candidate Catalog

Number

(1)

Name

(SDSS J)

(2)

R.A.

(deg)

(3)

Decl.

(deg)

(4)

Obj. ID

(5)

Row

(6)

Col.

(7)

u

(8)

g(9)

r

(10)

i

(11)

z

(12)

1................. 000001.88!094652.1 0.0078478 !9.7811413 1-1729-21-4-83-116 370.57 1729.17 19.781 19.530 19.335 19.401 19.407

2................. 000002.21!094956.0 0.0092176 !9.8322327 1-1729-21-4-83-118 389.98 1264.98 20.396 20.281 20.296 20.209 20.152

3................. 000006.53+003055.2 0.0272316 0.5153435 1-3325-20-5-108-117 656.47 978.59 20.405 20.459 20.336 20.100 20.076

4................. 000007.58+002943.3 0.0316062 0.4953686 1-3325-20-5-108-131 696.30 797.03 21.085 20.440 20.471 20.336 19.958

5................. 000008.13+001634.6 0.0339044 0.2762998 1-2662-20-4-283-149 253.50 673.27 20.240 20.201 19.949 19.498 19.194

Notes.—Table 1 is available in its entirety in the electronic edition of the Astrophysical Journal Supplement. A portion is shown here for guidance regarding itsform and content. The machine-readable version contains additional columns.

RICHARDS ET AL.262 Vol. 155

Page 13: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy

Example 3: inferring stellar parameters from spectra

11

A&A 467, 1373–1387 (2007)DOI: 10.1051/0004-6361:20077334c© ESO 2007

Astronomy&

Astrophysics

Estimation of stellar atmospheric parametersfrom SDSS/SEGUE spectra

P. Re Fiorentin1, C. A. L. Bailer-Jones1, Y. S. Lee2, T. C. Beers2, T. Sivarani2, R. Wilhelm3,C. Allende Prieto4, and J. E. Norris5

1 Max Planck Institut für Astronomie, Königstuhl 17, 69117 Heidelberg, Germanye-mail: [email protected]

2 Department of Physics & Astronomy, CSCE: Center for the Study of Cosmic Evolution, and JINA: Joint Institute for NuclearAstrophysics, Michigan State University, East Lansing, MI 48824, USA

3 Department of Physics, Texas Tech University, Lubbock, TX 79409, USA4 Department of Astronomy, University of Texas, Austin, TX 78712, USA5 Research School of Astronomy and Astrophysics, Australian National University, Weston, ACT 2611, Australia

Received 20 February 2007 / Accepted 8 March 2007

ABSTRACT

We present techniques for the estimation of stellar atmospheric parameters (Teff , log g, [Fe/H]) for stars from the SDSS/SEGUEsurvey. The atmospheric parameters are derived from the observed medium-resolution (R = 2000) stellar spectra using non-linearregression models trained either on (1) pre-classified observed data or (2) synthetic stellar spectra. In the first case we use our modelsto automate and generalize parametrization produced by a preliminary version of the SDSS/SEGUE Spectroscopic Parameter Pipeline(SSPP). In the second case we directly model the mapping between synthetic spectra (derived from Kurucz model atmospheres) andthe atmospheric parameters, independently of any intermediate estimates. After training, we apply our models to various samples ofSDSS spectra to derive atmospheric parameters, and compare our results with those obtained previously by the SSPP for the samesamples. We obtain consistency between the two approaches, with RMS deviations on the order of 150 K in Teff , 0.35 dex in log g,and 0.22 dex in [Fe/H].The models are applied to pre-processed spectra, either via Principal Component Analysis (PCA) or a Wavelength Range Selection(WRS) method, which employs a subset of the full 3850–9000 Å spectral range. This is both for computational reasons (robustnessand speed), and because it delivers higher accuracy (better generalization of what the models have learned). Broadly speaking, thePCA is demonstrated to deliver more accurate atmospheric parameters when the training data are the actual SDSS spectra withpreviously estimated parameters, whereas WRS appears superior for the estimation of log g via synthetic templates, especially forlower signal-to-noise spectra. From a subsample of some 19 000 stars with previous determinations of the atmospheric parameters,the accuracies of our predictions (mean absolute errors) for each parameter are Teff to 170/170 K, log g to 0.36/0.45 dex, and [Fe/H]to 0.19/0.26 dex, for methods (1) and (2), respectively. We measure the intrinsic errors of our models by training on synthetic spectraand evaluating their performance on an independent set of synthetic spectra. This yields RMS accuracies of 50 K, 0.02 dex, and0.03 dex on Teff , log g, and [Fe/H], respectively.Our approach can be readily deployed in an automated analysis pipeline, and can easily be retrained as improved stellar models andsynthetic spectra become available. We nonetheless emphasise that this approach relies on an accurate calibration and pre-processingof the data (to minimize mismatch between the real and synthetic data), as well as sensible choices concerning feature selection.From an analysis of cluster candidates with available SDSS spectroscopy (M 15, M 13, M 2, and NGC 2420), and assuming the age,metallicity, and distances given in the literature are correct, we find evidence for small systematic offsets in Teff and/or log g for theparameter estimates from the model trained on real data with the SSPP. Thus, this model turns out to derive more precise, but lessaccurate, atmospheric parameters than the model trained on synthetic data.

Key words. surveys – methods: data analysis – methods: statistical – stars: fundamental parameters

1. Introduction

The nature of the stellar populations of the Milky Way galaxyremains an important issue for astrophysics, because it addressesthe question of galaxy formation and evolution and the evolutionof the chemical elements. To date, however, studies of the stellarpopulations, kinematics, and chemical abundances in the Galaxyhave mostly been limited by small number statistics.

Fortunately, this state of affairs is rapidly changing. TheSloan Digital Sky Survey (SDSS; York et al. 2000) has im-aged over 8000 square degrees of the northern Galactic cap(above |b| = 40◦) in the ugriz photometric system for some100 million stars. Imaging data are produced simultaneously

(Fukugita et al. 1996; Gunn et al. 1998, 2006; Hogg et al. 2001;Abazajian et al. 2005; Adelman-McCarthy et al. 2007) and pro-cessed through pipelines to measure photometric and astrometricproperties (Lupton et al. 1987; Stoughton et al. 2002; Smith et al.2002; Tucker et al. 2002; Pier et al. 2003; Ivézic et al. 2004) andto select targets for spectroscopic follow-up. Of even greater im-portance, some 200 000 medium-resolution stellar spectra havebeen obtained during the course of SDSS-I (the original survey).

The SDSS-II project, which includes SEGUE (SloanExtension for Galactic Understanding and Exploration), is ob-taining some 3500 square degrees of additional imaging dataat lower Galactic latitudes, in order to better explore the in-terface between the thick-disk and halo populations between

Article published by EDP Sciences and available at http://www.aanda.org or http://dx.doi.org/10.1051/0004-6361:20077334

• supervised learning: spectra to stellar APs• nonlinear, multidimensional regression (ANN)• dimensionality reduction via PCA

Page 14: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 12

P. Re Fiorentin et al.: Estimation of stellar atmospheric parameters from SDSS/SEGUE spectra 1377

Fig. 2. Reconstruction of SDSS/SEGUE spectra by projection onto synthetic principal components. In each row, the spectrum on the left is theoriginal and the following show the reconstruction using increasing numbers of principal components. The residual spectrum (original minusreconstructed) is shown in the bottom of each panel. The quoted atmospheric parameters are taken from a preliminary version of the processingpipeline SSPP.

Fig. 3. As Fig. 2 but for principal components built from real spectra.

Page 15: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 13

1380 P. Re Fiorentin et al.: Estimation of stellar atmospheric parameters from SDSS/SEGUE spectra

Fig. 7. Atmospheric parameters estimation with the SR model. Comparison between our derived log Teff , log g, [Fe/H] and those estimated by apreliminary version of SSPP for a set of 38 731 stars. The perfect correlation and a linear fit to the data are shown with the solid and dashed linesrespectively. The distribution of the residuals (model minus SSPP) are shown in the bottom panels.

3900–4400 Å, 4820–5000Å, 5155–5350Å and 8500–8700 Å).This is perhaps not unexpected, since essentially all of the meth-ods that are used by the SSPP to define the target log g values useonly these restricted wavelength ranges. This may also indicatethat the gravity signature in real stars outside of the wavelengthregions selected above behaves differently from the signature inthe synthetic spectra. Either way, the excluded regions show lesssensitivity to log g, so for this parameter these regions do not addinformation, only data that are uncorrelated with the parameterof interest (so are effectively just noise). It is also possible, ofcourse, that the PCA may be filtering out subtle (weak) featureswhich are strong predictors of log g.

Based on the above considerations, our final model uses PCAfor estimating Teff and [Fe/H] and WRS for estimating log g. Aseparate model is used for estimating each parameter (althoughthe [Fe/H] model also predicts the other two, the results of whichare disregarded).

Figure 7 compares our model atmospheric parameter esti-mates with those from the preliminary SSPP for the 38 731 starsin the evaluation set. While the overall consistency betweenthe two models is reasonably good, we (again) notice discrep-ancies at the extreme parameter values, in particular for Teff.This is sometimes an indication that the model has not beenwell trained, i.e., it has not located a good local minimum ofthe error function (it can never be shown that the global mini-mum has been found with anything but an exhaustive search).However, there are inevitably problems with spectral mismatch,in the sense that the synthetic spectra do not reproduce all ofthe complexities of the spectra of real stars. The absence of

Table 3. Mean absolute discrepancies (between our SR model andSSPP) calculated on the evaluation set of 38 731 real spectra (see alsoFig. 7). Our models use PCA pre-processing for estimating log Teff and[Fe/H] and WRS pre-processing for estimating log g; for the latter,PCA results are shown for comparison. Separate models were appliedfor low and high SNR spectra (the transition being at SNR = 35/1).

Method SNR Elog Teff Elog g E[Fe/H]

PCA (25+25) 0.0138 0.4288 0.2606low 0.0143 0.7549 0.3023high 0.0136 0.3465 0.2384

WRS – 0.4459 –low – 0.4495 –high – 0.4450 –

several molecular species in the linelists for the synthetic spectramay also be contributing to this problem, especially for coolerstars where they are expected to be more important. For the de-termination of metallicity, we observe that our model predictslower metallicities for the lowest metallicity stars. This is prob-ably a consequence of the lack of synthetic samples between−4.0 < [Fe/H] < −2.5 (see Fig. 1) in our current grid.

Table 3 shows the global results (averaged over all starsand atmospheric parameters). It is interesting that the WRS pre-processing results in little difference in the log g discrepancyfor the low and high SNR regimes. Results for gravity, metallic-ity, and effective temperature ranges – dwarfs/giants, low/highmetallicity, and cool/warm stars – are listed in Table 7 and inTable 8, and visualized in Fig. 8. We note that, in the estimation

Page 16: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 14

1384 P. Re Fiorentin et al.: Estimation of stellar atmospheric parameters from SDSS/SEGUE spectra

0.0 0.2 0.4 0.6 0.8 1.0

42

0−2

g−r

Mg

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1

log Teff

log

g

RR

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1

log Teff

log

g

SR

Fig. 12. The left panel shows the colour–magnitude diagram for M 15, and the two other panels the distribution of atmospheric parameters log gvs. log Teff from the RR (middle) and SR (right) models. Of the stars selected as candidates, the asterisks denote main sequence metal-poor stars,the filled dots the members based on radial velocity constrain. Confirmed and doubtful members assigned in a preliminary analysis are colouredgrey and marked as plus sign respectively. Overplotted are isochrones from Girardi et al. (2004) with metallicities and ages which bracket thevalues given in Table 4, i.e. at Z = 0.0001 (solid), Z = 0.0004 (dashed) for ages of 12.59 Gyr (black) and 14.13 Gyr (grey).

0.0 0.2 0.4 0.6 0.8 1.0

86

42

0

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1 RR

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1 SR

0.0 0.2 0.4 0.6 0.8 1.0

86

42

0

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1 RR

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1 SR

0.0 0.2 0.4 0.6 0.8 1.0

86

42

0

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1 RR

3.95 3.90 3.85 3.80 3.75 3.70 3.65

54

32

1 SR

Fig. 13. As Fig. 12. Top: M 13 (globular cluster) candidates. Centre: M 2 (globular cluster) candidates. Bottom: NGC 2420 (open cluster) candi-dates. For the globular clusters, isochrones at Z = 0.0001 (solid), Z = 0.0004 (dashed) for ages of 12.59 Gyr (black) and 14.13 Gyr (grey); for theopen cluster, isochrones at Z = 0.004 (solid), Z = 0.008 (dashed) for ages of 3.162 Gyr (black) and 3.548 Gyr (grey).

Table 4. Globular/Open Clusters, literature values. The selection constraints applied for identification of likely members are labeled with∗.

Cluster RA,Dec (epoch J2000) RA∗ Dec∗ [Fe/H] age m − M RV RV∗

(degree) (degree) (dex) (Gyr) (km s−1) (km s−1)M 15 21h29m58.3s, +12◦10′01′′ 322.25, 322.75 11.90, 12.40 −2.22 13.2 14.93 −110 −126,−100M 13 16h41m41.5s, +36◦27′37′′ 250.00, 250.90 36.10, 36.90 −1.70 12.7 14.07 −250 −262,−243M 2 21h33m29.3s, −00◦49′23′′ 323.10, 323.60 −1.05,−0.60 −1.53 13.0 10.49 0 −20, 20NGC 2420 07h38m24.0s, +21◦34′27′′ 114.40, 115.10 21.20, 22.10 −0.50 3/4 11.40 73 50, 86

Page 17: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 15

What is machine learning?

involves− data description and interpretation− prediction− learning from data: inference

two general properties of ML algorithms− can extract information from high (>3) dimensional data− deal with general properties of the data rather than depending on

some specific physical interpretation aka

− pattern recognition, statistical data analysis, statistical learning... non-trivial data analysis of large and/or multidimensional

and/or complicated data sets

Page 18: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 16

Goals of this seminar series

Learn about machine learning methods Learn about interesting problems in astronomy and how

machine learning methods have been applied to them Get experience in presenting, critically assessing and

discussing work / articles Learn about other branches of astronomy Get ideas for research projects or methods to apply to your

problems Activate the brain and broaden your horizons!

Page 19: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 17

Format of each seminar

One topic per week, e.g. “stellar population fitting” Based around published work or own work Lead by one or two volunteers

− identify and suggest literature (to me)− make one or two presentations (~ 30 mins total)‏

Discuss topic− specific and broader themes− others can prepare comments in advance

Page 20: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 18

How to prepare / present seminar

Introduce topic− context, significance etc.

Present articles / own work− summarise goals, methods, results, conclusions− take figures from articles− go into maths where necessary and useful− should be a critical assessment: assumptions; advantages and

disadvantages; problems; outright errors− mention alternative methods, conflicting results

Suggest literature to me at least two weeks in advance (I will circulate / put on web page)

Page 21: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 19

Example topics

Signal detection Object classification Stellar spectral parametrization Galaxy and quasar spectral parametrization (Galaxy) morphological classification Time series Spatial clustering Population analysis Class discovery Large scale structure

Page 22: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy X

Sources for detailed topics / literature

keyword search in ADS, arXiv Classification and Discovery in Large Astronomical

Surveys− http://www.mpia-hd.mpg.de/class2008/

come and talk to me

Page 23: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 20

Concepts and themes

supervised, unsupervised, semi-supervised methods definition of models preprocessing and feature selection inference, learning and generalization curse of dimensionality model complexity control (regularization)‏ model assessment model comparison and hypothesis testing Bayesian vs. orthodox (frequentist) methods

Page 24: Coryn Bailer-Jones // · 2009. 4. 1. · a Gaussian and an approximate formula that gives an excel-lent best fit, although properly it is defined in the unlimited interval [0, +∞),

C.A.L. Bailer-Jones. Applications of Machine Learning in Astronomy 21

8 April15 April22 April29 April6 May13 May20 May27 May3 June10 June17 June24 June1 July8 July