2008 nvo summer school11 scientific data mining in astronomy kirk d. borne george mason university...

47
2008 NVO Summer School 1 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University [email protected] http:// classweb.gmu.edu/kborne / THE US NATIONAL VIRTUAL OBSERVATORY

Upload: isabella-moore

Post on 27-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 11

Scientific Data Mining in Astronomy

Kirk D. BorneGeorge Mason University

[email protected] http://classweb.gmu.edu/kborne/

THE US NATIONAL VIRTUAL OBSERVATORY

Page 2: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 2

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 3: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 3

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 4: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 44

10 Unique Features of Scientific Data

• Each of these characteristics requires special handling beyond what you read in standard data mining textbooks:

1. Scientific data depend on experimental equipment and conditions.2. Scientific data have noise.3. Scientific data have been (or need to be) calibrated.4. Scientific units on data values are imperative.5. Scientific databases often contain associated columns: { value, error }.6. Scientific data values are often non-linear (log values, magnitudes, asinh).7. History of scientific data creation, processing, and versioning is critical =

Provenance.8. Metadata, Metadata, Metadata = tells us “who, what, when, where, how”.

NOTE: Semantic Metadata are becoming more important = “why”.9. Context is critical (e.g., brightness in an optical catalog is expressed in mags,

but expressed in counts/sec in an X-ray catalog, or milli-Jansky in a radio catalog).

10. Scientific data have different levels of abstraction: raw, calibrated, reduced data products, derived information, extracted knowledge, published results.

• All of this makes the “Data Preparation” phase of any scientific data mining experiment even more critical and essential.

Page 5: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 5

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 6: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 66

Some key astronomy problems

• Some key astronomy problems that can be addressed with data mining techniques:

• Cross-Match objects from different catalogues• The distance problem (e.g., Photometric Redshift estimators)• Star-Galaxy Separation• Cosmic-Ray Detection in images• Supernova Detection and Classification• Morphological Classification (galaxies, AGN, gravitational lenses, ...)• Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...)• Dimension Reduction = Correlation Discovery• Learning Rules for improved classifiers • Classification of massive data streams• Real-time Classification of Astronomical Events • Clustering of massive data collections• Novelty, Anomaly, Outlier Detection in massive databases

Page 7: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 7

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 8: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 8

Classification Methods:Decision Trees, Neural Networks, SVM (Support Vector Machines)

There are 2 Classes!

How do you ...-Separate them?-Distinguish them?-Learn the rules?-Classify them?

ApplyKernel

(SVM)

Page 9: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 9

Decision Tree Classification Example: SKICAT Star-Galaxy DiscriminationReference: ftp://iraf.noao.edu/iraf/conf/web/adass_proc/adass_95/yooj/yooj.html

Page 10: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 10

Decision Tree Classification Example: Classification of candidates for new supernova in galaxiesReference: http://spiff.rit.edu/richmond/sdss/sn_survey/scan_manual/sn_scan.html

Page 11: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 11

Clustering is used to discover the different unique groupings (classes) of attribute values.The case shown below is not obvious: one or two groups?

Page 12: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 12

This case is easier: there are two groups.(in fact, this is the same set of data elements as shown on the previous slide, but plotted here using a different attribute.)

Page 13: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 13

Clustering in multiple dimensions: colors combined from SDSS & 2MASS magnitudes

Page 14: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School

Clustering: Class Discovery and Rule Learning

• Clusters and the separation of classes depend on which attributes (dimensions) are chosen to be projected, as in the following star-galaxy discrimination test:

14

• Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

Not good Good

Page 15: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School

Semisupervised Learning:Outlier Detection

• Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

15

A demonstration of a generic machine-assisted discovery problem — data mapping and a search for outliers.

This schematic illustration is of the clustering problem in a parameter space given by three object attributes: P1, P2, and P3.

In this example, most of the data points are assumed to be contained in three, dominant clusters (DC1, DC2, and DC3).

However, one may want to discover less populated clusters (e.g., small groups or even isolated points), some of which may be too sparsely populated, or lie too close to one of the major data clouds.

In some cases, negative clusters (holes), may exist in one of the major data clusters.

Page 16: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 16

Outlier Detection: Serendipitous Discovery of Rare or New Objects & Events

Page 17: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 17

Principal Components Analysis &Independent Components Analysis

Cepheid Variables:Cosmic Yardsticks-- One Correlation-- Two Classes!

... Class Discovery!

Page 18: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 1818

Example: SOM (Self-Organizing Map)

• The SOM (Self-Organizing Map) is one technique for organizing information in a database based upon links between concepts.

• It can be used to find hidden relationships and patterns in more complex data collections, usually based on links between keywords or metadata.

Page 19: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 19

Mega-Flares on normalSun-like stars = a star like our Sun increased in brightness 300X one

night!… say what??

Exploringthe Time Domain

Astronomy Data Mining in Action

Page 20: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 20

Example: The Thinking Telescope

Sample Data Mining Applications: (credit: http://www.thinkingtelescopes.lanl.gov/ )Automated Feature Extraction: Real-time identification of artifacts and transients in direct and difference images.Classifiers: Automated classification of celestial objects based on temporal and spectral properties.Anomaly Detection: Real-time recognition of important deviations from normal behavior for persistent sources.

Page 21: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 21

From Sensors to Sense

From Data to Knowledge:from sensors to sense

From Data to Knowledge:from sensors to sense

Data Information Knowledge

Page 22: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 22

VOEventNet

Event Synthesis Engine

Pairitel

Palomar 60”

Raptor

PQ next-daypipelines

catalog

Palomar-Quest

knownVariables

knownasteroids

SDSS2MASS

PQ Event Factory

remote archives

baselinesky

eStar

VOEventNet

VOEventNet: a Rapid-Response Telescope Grid GRBsatellites

VOEventdatabase

Reference: http://voeventnet.caltech.edu/

Page 23: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 23

Learning From Archived Temporal Data (Time Series):Classify New Data (Bayes Analysis or Markov Modeling)

Page 24: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 2424

Photometric-Redshift Estimation

Photometric vs. Spectroscopic Redshift Estimates:• Left panel: standard technique• Right panel: Machine Learning (data mining) application• Reference: http://arxiv.org/abs/0710.4482

Page 25: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 2525

Star-Galaxy Separation in Clustered Feature Space

* = star• = galaxy

http://arxiv.org/abs/astro-ph/9508012

Page 26: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 2626

Bayesian Probabilistic Estimationfor Catalog Cross-Matching

• Reference: http://arxiv.org/abs/astro-ph/0605216

Page 27: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 2727

Fundamental Plane for 156,000 cross-matched Sloan+2MASS Elliptical Galaxies: plot shows variance captured by first 2 Principal Components as a function of local galaxy density.

• Slide Content• Slide content• Slide content• Slide content

low (Local Galaxy Density) high

% o

f va

rian

ce c

ap

ture

d b

y P

C1+

PC

2Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008

Page 28: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 28

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 29: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 2929

Suggested Reading: Data Mining in Astronomy• Djorgovski et al. 2000, Searches for Rare and New Types of Objects.

http://arxiv.org/abs/astro-ph/0012453 • Djorgovski et al. 2000, Exploration of Large Digital Sky Surveys.

http://arxiv.org/abs/astro-ph/0012489 • Djorgovski et al. 2001, Exploration of Parameter Spaces in a Virtual Observatory.

http://arxiv.org/abs/astro-ph/0108346 • Mining the Sky, 2001, published proceedings of ESO conference.• Suchkov et al. 2003, Automated Object Classification with ClassX.

astro-ph/0210407

• Suchkov, Hanisch, & Margon 2005, A Census of Object Types and Redshift Estimates in the SDSS Photometric Catalog from a Trained Decision Tree Classifier. http://adsabs.harvard.edu/abs/2005AJ....130.2439S

• Giannella et al. 2006, Distributed Data Mining for Astronomy Catalogs. http://www.cs.umbc.edu/~hillol/PUBS/Papers/Astro.pdf

• Rohde et al. 2006, Matching of Catalogues by Probabilistic Pattern Classification. http://adsabs.harvard.edu/abs/2006MNRAS.369....2R

• Budavari & Szalay 2008, Probabilistic Cross-Identification of Astronomical Sources. http://adsabs.harvard.edu/abs/2008ApJ...679..301B

Page 30: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 3030

Suggested Reading, continued: Data Mining in Astronomy• Odewahn et al. 1993, Star-Galaxy Separation with a Neural Network. 2: Multiple

Schmidt Plate Fields. http://adsabs.harvard.edu/abs/1993PASP..105.1354O • Borne 2000, Science User Scenarios for a Virtual Observatory Design Reference

Mission: Science Requirements for Data Mining. astro-ph/0008307• Brunner et al. 2001, Massive Datasets in Astronomy. astro-ph/0106481• Gray et al. 2002, Data Mining the SDSS SkyServer Database.

http://arxiv.org/abs/cs/0202014 • Odewahn et al. 2004, The Digitized Second Palomar Observatory Sky Survey

(DPOSS). III. Star-Galaxy Separation. http://adsabs.harvard.edu/abs/2004AJ....128.3092O

• Ball, Brunner, et al. 2006, Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees. http://adsabs.harvard.edu/abs/2006ApJ...650..497B

• Ball, Brunner, et al. 2007, Robust Machine Learning Applied to Astronomical Data Sets. II. Quantifying Photometric Redshifts for Quasars Using Instance-based Learning. http://adsabs.harvard.edu/abs/2007ApJ...663..774B

• Ball, Brunner, et al. 2008, Robust Machine Learning Applied to Astronomical Data Sets. III. Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and GALEX. http://adsabs.harvard.edu/abs/2008ApJ...683...12B

Page 31: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 3131

Suggested Reading, continued: Data Mining in Astronomy• Rogers & Riess 1994, Detection and Classification of CCD Defects with an

Artificial Neural Network. http://adsabs.harvard.edu/abs/1994PASP..106..532R • Feeney et al. 2005, Automated Detection of Classical Novae with Neural

Networks. http://adsabs.harvard.edu/abs/2005AJ....130...84F • Wadadekar 2005, Estimating Photometric Redshifts Using Support Vector

Machines. http://adsabs.harvard.edu/abs/2005PASP..117...79W • Bazell & Miller 2005, Class Discovery in Galaxy Classification.

http://adsabs.harvard.edu/abs/2005ApJ...618..723B • Bazell, Miller, & SubbaRao 2006, Objective Subclass Determination of Sloan

Digital Sky Survey Spectroscopically Unclassified Objects. http://adsabs.harvard.edu/abs/2006ApJ...649..678B

• Ferreras et al. 2006, A Principal Component Analysis approach to the Star Formation History of Elliptical Galaxies in Compact Groups. http://adsabs.harvard.edu/abs/2006MNRAS.370..828F

• Way & Srivastava 2006, Novel Methods for Predicting Photometric Redshifts from Broadband Photometry Using Virtual Sensors. http://adsabs.harvard.edu/abs/2006ApJ...647..102W

• Carliles et al. 2007, Photometric Redshift Estimation on SDSS Data Using Random Forests. http://arxiv.org/abs/0711.2477

Page 32: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 32

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 33: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 3333

Some Data Mining Software & Projects

• General data mining software packages:– Weka (Java): http://www.cs.waikato.ac.nz/ml/weka/ – Weka4WS (Grid-enabled): http://grid.deis.unical.it/weka4ws/ – RapidMiner: http://www.rapidminer.com/

• Astronomy-specific software and/or user clients:• VO-Neural: http://voneural.na.infn.it/• AstroWeka: http://astroweka.sourceforge.net/• OpenSkyQuery: http://www.openskyquery.net/ • ALADIN: http://aladin.u-strasbg.fr/ • MIRAGE: http://cm.bell-labs.com/who/tkh/mirage/ • AstroBox: http://services.china-vo.org/

• Astronomical and/or Scientific Data Mining Projects:• GRIST: http://grist.caltech.edu/• ClassX: http://heasarc.gsfc.nasa.gov/classx/ • LCDM: http://dposs.ncsa.uiuc.edu/ • F-MASS: http://www.itsc.uah.edu/f-mass/ • NCDM: http://www.ncdm.uic.edu/

Page 34: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 3434

Weka:http://www.cs.waikato.ac.nz/ml/weka/

• Weka is in your NVOSS software distribution.• Weka is a collection of open source machine learning algorithms for

data mining tasks. • Weka algorithms can either be applied directly to a dataset or called

from your own Java code. • Weka comes with its own GUI.• Weka contains tools for data pre-processing, classification,

regression, clustering, association rules, and visualization.

Page 35: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 3535

AstroWeka:http://astroweka.sourceforge.net/

http://www.iterating.com/products/Wekahttp://weka.sourceforge.net/wekadoc/index.php/en:Knowledge_Flow_

%283.4.10%29

Page 36: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School

ALADIN: http://aladin.u-strasbg.fr/

36

Page 37: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School

MIRAGE:http://cm.bell-labs.com/who/tkh/mirage/

Java Package for exploratory data analysis (EDA), correlation mining, and interactive pattern discovery.

37

Page 38: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 38

OUTLINE

• Scientific Databases• Some key astronomy problems• Astronomy Data Mining examples• Suggested Reading• Some Data Mining Software• Summary

Page 39: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 39

Science is Knowledge Work

• Knowledge Discovery is the central theme of science.

• Knowledge Discovery in Databases (KDD) is the killer app for large scientific databases.

• Therefore, KDD (i.e., Data Mining) is an essential tool, since “big-data” science is here to stay (at petabytes and beyond).

Data Information Knowledge

Page 40: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 40

Scientific Knowledge Discovery

Page 41: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 41

Heliophysics Space Weather Example

Page 42: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 42

Sun-Earth Space Environment – Rich Source of Heliophysical Phenomena

Page 43: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 43

Multi-point Observations and Models of Space Plasmas Deliver a Deluge of Physical Measurements

Page 44: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 44

Page 45: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 45

Heliophysics Space Weather Example

CME = Coronal Mass EjectionSEP = Solar Energetic Particle

Page 46: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 46

Data Mining:It is more than just connecting the dots

Reference: http://homepage.interaccess.com/~purcellm/lcas/Cartoons/cartoons.htm

Page 47: 2008 NVO Summer School11 Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu  T HE

2008 NVO Summer School 47

Sample Astronomy Data Mining ApplicationIdeas for your Projects

– Neural Network for Pixel Classification: Event Detection and Prediction (e.g., Supernova or Cosmic-ray hit?)

– Bayesian Network for Object Classification (star or galaxy?)

– PCA for finding Fundamental Planes of Galaxy Parameters

– PCA (weakest component) for Outlier Detection: anomalies, novel discoveries, new objects

– Link Analysis (Association Mining) for Causal Event Detection (e.g., linking optical transients with gamma-ray events)

– Clustering analysis: Spatial, Temporal, or any scientific database parameters

– Markov models: Temporal mining, classification, and prediction from time series data