introduction to data mining outline (in astronomy)sabinemcconnell/tutorial adass 2007.pdf ·...

1

Introduction to Data Mining(in Astronomy)

ADASS 2007 Tutorial

Sabine McConnellDepartment of Computer Science/Studies

Trent University

Outline• Introduction• The Data • Classification• Clustering• Evaluation of Results• Increasing the Accuracy• Some Issues and Concerns• Weka• References

What is data mining?

“The non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.”

(Piatetsky-Shapiro)

“ The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. ”

(Hand)

Data mining is a combination of:

- machine learning- statistics- databases- visualization- application domain

2

(Some) Applications of data-mining techniques

• Science: bioinformatics, discovery of drugs, astronomy

• Government: law enforcement, income tax, anti-terror

• Business: Market basket analysis, targeted marketing

• Engineering: Satellite navigation

Data mining in astronomy

• Classification of stars, galaxies and planetary nebulae, both based on images and spectral parameters

• star/galaxy separation • forecasting of sunspots and of geomagnetic storms from

solar wind • forecasting of seeing • gravitational wave signal detection • antimatter search in cosmic rays • selection of Quasar candidates • detection of expanding HI shells • ....many more

View of the Dataset (= Matrix)

object ID _RAJ2000 _DEJ2000 distance flags x size y size U-B error Bar? class134633 00 03 09.1 +21 57 34 398 A 1629 1654 14.4 low no Irr

3555432 00 03 48.8 +07 28 45 113 D 939 1332 14 medium yes Spiral3432223 00 03 58.6 +20 45 07 835 A 1713 2219 12.7 low no Ell124123 00 05 53.0 +22 32 14 398 A 1092 1400 0 low no Irr333456 00 06 21.4 +17 26 03 398 A 1121 1419 15.1 low no Irr

3355478 00 07 16.7 +27 42 31 398 A 1343 1810 13.4 high no Spiral875 00 07 16.1 +08 18 03 879 A 1095 1281 14.6 medium yes Spiral

33378 00 08 10.7 +27 00 15 578 A 1154 1493 14.4 high no Irr569433 00 08 20.5 +40 37 54 398 A 1661 1683 0 low no Irr

3321347 00 09 54.3 +25 55 28 778 A 1961 2180 12.5 low no Spiral5464648 00 10 47.7 +33 21 18 79 B 929 1359 13.5 high no Ell

454345476 00 12 49.9 +77 47 44 398 A 1393 1671 0 low no Irr4646788 00 13 27.5 +17 29 16 398 A 1141 1573 14.2 medium yes Spiral

• levels of measurement (nominal, ordinal, interval, ratio)• numeric vs. categorical

The data-mining process(Knowledge discovery in databases)

3

Data Preparation Preprocessing and Algorithms• Neural networks like data to be scaled • Decision trees do not care about scaling, but work better with discrete attributes

that have small numbers of possible values• Neural networks can handle irrelevant or redundant attributes, while they may

lead to large decision trees• Neural networks do not like noisy data, especially for small datasets, while decision

trees do not care about noise much• Nearest-neighbour approaches can handle noise if a certain parameter is adjusted• Distance-based approaches do not work well if the attributes are not equally

weighted, and typically work with numerical data only• Expectation Maximization approaches can deal with missing data, but k-means

techniques require substitution of missing data• ….

A Comparison of Neural Network Algorithms and Preprocessing Methods for Star-Galaxy Discrimination, D. Bazell and Y. Peng, Astrophysical Journal Supplement Series:47-55, May 1998

Data Preparation Issues

• transformation of attribute types• selection of attributes• transformation of attributes• normalization of attribute values• sampling• missing values

Data preparation: transformation of attribute

types

• categorical to numeric• numeric to categorical

4

Transformation: categorical to numeric

• map to circle, sphere or hypersphere– may work if categories are ordinal (i.e. days of the week) – usually produces poor results otherwise

• map to generalized tetrahedron– to uniquely represent k possible attribute

values, we need k new attributes. – example: an attribute with three possible values (i.e. circle,

square, triangle) maps to three new attributes with the values (1,0,0) for circle, (0,1,0) for square, and (0,0,1) for triangle

– works for both ordinal and nominal data

Transformation: numeric to categorical

• some data mining algorithms require data to be categorical• we may have to transform continuous attributes into

categorical attributes: discretization or• transform continuous and discrete data into binary data:

binarization• we also have to distinguish between unsupervised (no use

of class information) and supervised (use of class information) discretization methods– unsupervised: equal-width or equal-frequency binning,

k-means, visual inspection– supervised: use some measure of impurity of bins

Data preparation: attribute selection

• Remove irrelevant or redundant attributes to reduce the dimensionality of the dataset

• Preserve the probability distribution of the classes present in the data as much as possible– Filter approach: Start with empty set, add attributes one

at the time– Wrapper approach:Start with full set, remove attributes

one at a time– Reduce search time by combining the two methods– alternative: use the upper levels of a decision tree,

providing there are class labels in the data

Data preparation: transformation of attributes

• Two popular methods:– Wavelet transforms– Principal component analysis

• express data in terms of new attributes• reduce the number of attributes by

truncating

5

Data preparation: normalization

• min-max normalization• z-score normalization (standardization)• normalization by decimal scaling

Data preparation: sampling

• Reduce the number of objects (rows) in the dataset– simple sample without replacement– simple random sample with replacement– cluster sample– stratified sample

Data preparation: sampling

stratified sample: preserves original distributions of classesundersampling/oversampling: equal distributions of classes

Missing values

• data may be missing completely at random, missing at random, or not missing at random (censored)

• depending on why the data is missing, we can use– casewise data deletion– mean substitution– regression– hot deck methods– maximum likelihood methods– multiple imputation– ...

6

Building the model Data-mining categories

• Classification• Clustering• Visualization• Association Rule Mining• Summarization• Outlier detection• Deviation detection• …

Models vs. Patterns

• Models: – Large-Scale Description of the Data– describe/predict/summarize the most common cases

• patterns:– small scale– local models– association rules, outliers– often most interesting objects

Predictive vs Descriptive Techniques• Data- mining techniques can be either

- predictive (supervised)- descriptive (unsupervised)

• predictive: predict (discrete) class attribute based on other attribute values. This is like learning from a teacher. →classification

• descriptive: discover structure of data without prior knowledge of class labels→clustering

• evolving area: semi-supervised (combines predictive and descriptive methods)

7

example: Automated morphological classification of APM galaxies by supervised artificial neural networks, Naim et

al., MNRAS 275, 567-590(1995)

• 830 galaxy images (diameter limited) from APM Equatorial Catalogue of Galaxies

• 24 parameters (inputs), including ellipticity, surface brightness, bulge size, arm number, length and intensity

• outputs: Revised Hubble Type of galaxy• classified by 6 human experts according to the Revised

Hubble System, and by a supervised neural network with • result: rms error for classification by networks as

compared with mean types of classification (1.8 Revised Hubble Types) is comparable to rms dispersion between experts

Predictive data mining: classification

(Learn a model to predict future target variables)

Given a set of points from classes what is the class of new point ?Is the new point a star or a galaxy?

galaxiesstars

Predictive data mining: classification (Decision Trees)

if y > 2 then if x > 5 then blueelse

if x > 4 then redelse blue

else if x > 2 then redelse blue

Y

X2

2

4 5

noy > 2 ?

yes

x > 5 ? x > 2 ?

x > 4 ?

Decision Trees:Choosing a splitting criteria

)|(log)|()(1

02 tiptiptentropy

c

i∑−

=

−=

[ ]21

0

)|(1)( ∑−

=

−=c

i

tiptgini

[ ])|(max1)(_ tipterrortionclassificai

−=

8

Decision tree: Measuring the impurity of a node

goal: a large change in impurity I after split

)()(

)(1

j

k

j

j vINvN

parentIgain ∑=

−=

information gain: when entropy is impurity measure

A

A

A A

A A

BB

BB

B BB B

class A class B

parent 3/7 4/7

left child 2/3 1/3

right child 1/4 3/4

5.074

731

22

=⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛−=

parentgini

45.031

321

22

_=⎟

⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛−=

childleftgini

375.043

411

22

_ =⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛−=childrightgini

1.0375.07445.0

735.0

)()(

)(1

=⎟⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛−=

−= ∑=

gain

vINvN

parentIgain j

k

j

j

repeat for all possible splits and choose best split.

Decision Trees: extensions

• oblique decision trees: allows test conditions that involve multiple attributes

• regression trees: value assigned to datum is the average of values in the node

• random forests: build multiple decision trees that include a random factor when choosing the attributes to split on

Characteristics of Decision Trees• decision boundaries are typically axis-parallel, • can handle both numeric and nominal attributes• for nominal attributes, decision trees tend to favor the

selection of attributes with larger numbers of possible values as splitting criteria

• the runtime is determined by the fact that numeric attributes need to be sorted, therefore classification is fairlyfast in typical settings

• can easily be converted to (possibly suboptimal) rule sets.• pruning of trees is recommended to reduce their

complexity, the pruning strategy is more important than the choice of splitting criteria.

• robust to noise

9

example: Decision Trees for Automated Identification of Cosmic-Ray Hits in Hubble Space Telescope Images, Salzberg et al., Publications of the Astronomical Society of the Pacific 107:279-288, March 1995

• oblique decision tree, starting at random locations for the hyperplanes

• overcomes local maxima by perturbation of the hyperplanes and restarting of the search at new location

• compares results from 5 different decision trees • reduction of the feature set, use of decision trees to

confirm labeling• over 95% accuracy for single, unpaired images

Predictive data mining: classification (Neural Networks)

- more complex borders- more accurate- may overfit the data

inputlayer

outputlayer

hiddenlayers

data

Neural Networks: Backpropagation

randomly initialize weights

repeat until stopping criterion satisfied

for each sample do

1. present sample to input nodes2. propagate data through layers, using weights and activation functions

3. calculate results at output nodes4. determine error at output nodes5. propagate error backwards to adjust the weights

Neural Networks: Extensions

• Madalines• Adaptive Multilayer Networks• Prediction Networks• Winner-Take-All Networks• Counterpropagation Networks• Learning Vector Quantizers• Principal Component Analysis Networks• Hopfield Networks• ….

10

Applications of Neural Networks in Astronomy:

• star/galaxy separation• spectral and morphological classification of galaxies• spectral classification of stars• determine number of binary stars in a cluster• reduce input dimensionality• classification of planetary nebulae• predictions of solar flux and sunspots• classification of asteroid spectra• adaptive optics• spacecraft control• interpolation of HI distribution in Perseus• classification of white dwards• detection and classification of CCD defects• search for antimatter• …

example: The use of Neural Networks to probe the structure of the nearby universe. d’Abrusco et al., To appear in the proceedings of the Astronomical Data Analysis -IV workshop held in Marseille in 2006.

• supervised neural network applied to SDSS data• training data: spectroscopic, contained 449 370 galaxies• training data divided into training, validation, and test sets• output: distance estimates for roughly 30 million galaxies

distributed over 8 000 sq. deg. • provides list of candidate AGN and QSO

Characteristics of Artificial Neural Networks

• slow• poor interpretability of results.• able to approximate any target function.• can learn to ignore irrelevant or redundant attributes • easy to parallelize.• may converge to local minimum because of greedy optimization, but

convergence to global maximum can be achieved through simulated annealing.

• choice of network structure non-trivial and time-consuming• sensitive to noise (a validation set may help here)

Lazy learners: Nearest-neighbourtechniques

Lazy learners do not build models

when a new datum is to be classified, it is assigned to the majority of the classes of its neighbours

11

Characteristics of Nearest-Neighbour Algorithms

• slow • does not work well with noisy data • does not provide the user with a model.• new data can easily be incorporated because of the

lack of model.• easy to parallelize.• may not work well if attributes are not equally

relevant, • decision boundaries are piece-wise linear.

Difference between predictive and descriptive approaches

• lack of class labels in the descriptive case: we need establish correspondence between clusters and real-life type of objects

• For predictive approaches, it is easier to see if there is agreement with human experts

• evaluation of descriptive approaches is much harder• descriptive approaches avoid the bias that may be

introduced by existing class labels, but introduce bias of their own (choice of distance measure, algorithms, and number of clusters)

Descriptive data mining: clustering

Goal: Find clusters of similar objects(Find groups of similar galaxies)

- which algorithm should I use?- when are objects similar?

Overview of Clustering Techniques

• major distinction:partitioning-based vs. hierarchical methods (fixed number vs. variable number of clusters)

• hierarchical methods are further divided into agglomerative and divisive clustering– agglomerative methods initially assign each sample to a separate

cluster, then merge clusters that are closest to each other in successive steps

– divisive methods start with one cluster containing alldata, then repeatedly split the cluster(s) until each sample belongs to a separate cluster

hierarchical clustering (produces dendrogram)

partition-based

12

Distance measures for objects

• Manhattan distance• Euclidean distance• Squared Euclidean distance• Chebychev distance• Hamming distance• Percent disagreement• …

Distance measures for clusters• minimum distance

(single linkage, nearest-neighbour)

• maximum distance (complete linkage, farthest-neighbour)

• mean distance

• average distance

the choice of distance measure for clusters will determine the cluster shape!

clustersdifferent from are p' and p where||min '

min ppd −=

clustersdifferent from are p' and p where||max '

min ppd −=

centercluster a indicates m where|| jimean mmd −=

clustersdifferent

from arep' and p where||1'

'∑∑ −=p pji

average ppnn

d

Descriptive data mining: clustering (K-means)

(numerical data only)

1) Randomly pick k cluster centers2) Assign every object to its nearest cluster center 3) Move each cluster center 4) Repeat steps 2,3 until stopping criterion is

satisfiedStep 1: randomly choose k cluster centers

K-means algorithm

x

x

x

13

Step 2: assign each point to the closest cluster center

K-means algorithm

x

x

xStep 3: move the cluster centers to represent the means of the

clusters

K-means algorithm

x

x

x

x

x

x

Step 4: reassign the points to the closest cluster center

K-means algorithm

x

x

x

Step 5: move cluster centers

K-means algorithm

x

x

x

xx

x

14

Step 6: reassign points and move cluster centers again, or terminate?

K-means algorithm

xx

x

Characteristics of k-means

• requires user-specified number of clusters• often converges to local optimum• does not perform well in the presence of outliers

and noise• is only useful when mean of a cluster is defined,

therefore most often used with numerical data only

• biased towards spherical clusters• cannot handle missing data

Other clustering approaches

• EM• k-medoids• model-based• grid-based• density-based• …

Evaluating the model

15

How can we evaluate (predictive and descriptive)

models?

Evaluation methods

• holdout method: use training and test sets• stratified holdout: preserve class

distribution• repeated holdout• k-fold cross-validation• leave-one-out cross validation• bootstrap sample: 0.632 sample

Training and test sets• split available data into two sets• one set is used to build the model• other set is used to evaluate the model• typical split: 2/3 of data as training set, rest for test set• does not work well for noisy data and small datasets• if a validation set is needed as well, the data available

for training is even more reduced• if test set is not representative sample of training set,

then accuracy of model may be underestimated

Cross-Validation

• split the data into k folds• use k-1 folds for training, 1 fold for testing• repeat k times so each fold is used for

testing once• repeat the whole process x times• average the resultsa typical value for x and k is 10

16

Increasing the accuracy

• boosting• bagging• randomization• ensembles

Bias-Variance Decomposition

• the classification error is the sum of bias, variance, and Bayes error rate

ec = bias +variance+ eB

• bias: measures how close the classifier will be to the function to be learned, on average

• variance: measures how much the estimates of the classifier will vary with changes in the dataset

• Bayes error rate: the minimum error rate associated with the Bayes optimal classifier

Increasing accuracy: bagging

• reduces variance• sample with replacement to create multiple

datasets• classify each of the datasets to produce

multiple methods• combine the individual models to produce

overall models

Increasing accuracy: boosting

• builds multiple models from dataset• each datum is associated with a weight• weights are adjusted over time:

– decrease the weight for data that are easy to classify– increase the weight for data that are hard to classify– build another model

• final model is constructed from all models, weighted by a score

17

Increasing accuracy: ensembles

• generalizes on the idea of bagging• build multiple models, that can vary in

– the input data– initial parameters: starting points, number of clusters,....– learning algorithms

• can be very powerful if the learning algorithms are weak learners (result changes substantially with change in dataset)

Many more data-mining techniques…

• association rules• sequence mining • random forests• Support Vector Machines• Naïve Bayes• genetic algorithms• …

Data Mining with Genetic Algorithms: Fitting a Galactic Model to An All-Sky Survey, Larsen and Humphreys

AJ, 125:19581979, April 2003

• genetic algorithms: survival of the fittest– fitness function to evaluate population– change population over time: random mutations,

crossover– evaluate population at each timestep, only the fittest

will survive

• derive global parameters for a Galactic model• magnitude-limited star counts from APS catalog• produces model counts for multi-directional data

Step-by-step quide: data preparation

– size of dataset: • number of attributes• number of samples per class

– transform attributes if necessary– normalize/standardize the data– select attributes – reduce dimensionality if possible (PCA for sparse data,

DWT for data with large numbers of attributes)

18

Step-by-step quide: evaluate the model

• 10-fold cross validation• never evaluate the model on the training

data• be careful when comparing models derived

with different techniques

Step-by-step quide: build the model

• descriptive techniques:– visualization– k-means algorithm– EM algorithm

• predictive approaches:– decision trees– neural networks

• combine both through semi-supervised learning

(Some) Data-mining concerns:

• curse of dimensionality• local minima• existing classifications• distributed nature of the

data• how can we describe the

models in general terms• can we standardize the

process somehow?• privacy issues

• missing values• normalization issues• multiple measurements• noisy data• error bars• cost of the models?• …

Crisp-DM

• Cross Industry Standard Process for Data Mining• http://www.crisp-dm.org/• describes commonly used approaches, mainly

from a business perspective• non-proprietary, documented, industry and tool-

independent model • describes best practices and structures of the data

mining process, similar to our model

19

Predictive Model Markup Language (PMML)

• XML-based language• define and share statistical and data-mining

models amongst applications (i.e. DB2, SAS, SPSS…)

<?xml version="1.0" ?> <!DOCTYPE PMML (View Source for full doctype...)>

- <PMML version="2.0">- <Header copyright="Copyright (c) 2001, Oracle Corporation. All rights reserved."><Application name="Oracle 9i Data Mining" version="9.2.0" /> </Header>

- <DataDictionary numberOfFields="1"> <DataField name="item" optype="categorical" /> </DataDictionary>

- <TransformationDictionary>- <DerivedField name="PETAL_LENGTH">+ <Discretize field="PETAL_LENGTH">- <DiscretizeBin binValue="1-1.59"><Interval closure="closedOpen" leftMargin="1.0" rightMargin="1.59" />

-

Curse of Dimensionality

• number of samples needed increases with dimensionality of the data

• data mining algorithms are often more than linear in the number of attributes

Distributed Data Mining

• Meta-learning• Collective Data Mining Framework• Data partitions/Ensembles

Weka Machine Learning Workbench

• available (no cost) at http://www.cs.waikato.nz/ml/weka

20

Wekainterfaces

arff format@relation 'labor-neg-data'@attribute 'duration' real@attribute 'wage-increase-first-year' real@attribute 'wage-increase-second-year' real@attribute 'wage-increase-third-year' real@attribute 'cost-of-living-adjustment' {'none','tcf','tc'}@attribute 'working-hours' real@attribute 'pension' {'none','ret_allw','empl_contr'}@attribute 'vacation' {'below_average','average','generous'}@attribute 'longterm-disability-assistance' {'yes','no'}@attribute 'contribution-to-dental-plan' {'none','half','full'}@attribute 'bereavement-assistance' {'yes','no'}@attribute 'contribution-to-health-plan' {'none','half','full'}@attribute 'class' {'bad','good'}@data1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good'3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'3,4.5,4.5,5,?,40,?,?,?,?,12,'average',?,'half','yes','half','good'2,2,2.5,?,?,35,?,?,6,'yes',12,'average',?,?,?,?,'good'3,6.9,4.8,2.3,?,40,?,?,3,?,12,'below_average',?,?,?,?,'good'2,3,7,?,?,38,?,12,25,'yes',11,'below_average','yes','half','yes',?,'good'

Commercial Data-Mining Software

• Clementine• Enterprise Miner• Insightful Miner• Intelligent Miner• Microsoft SQL Server 2005• MineSet• Oracle Data Mining• Cart• …

21

References

• Introduction to Data Mining, P. Tan, M. Steinbach, and V. Kumar, Addison Wesley, 2006

• Data Mining: Practical Machine Learning Tools and Techniques, I. Witten and E. Frank, Morgan Kaufmann, 2005

• Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann, 2006

References

• http://people.trentu.ca/sabinemcconnell/• www.kdnuggets.com• http://www.twocrows.com/glossary.htm

introduction to data mining outline (in astronomy)sabinemcconnell/tutorial adass 2007.pdf ·...

Documents