data mining and the optiputer padhraic smyth university of california, irvine

12
Data Mining and the OptIPuter Padhraic Smyth University of California, Irvine

Upload: jason-edgar-mcdowell

Post on 16-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Data Mining and the OptIPuter

Padhraic SmythUniversity of California, Irvine

Data Mining of Spatio-Temporal Scientific Data

– Modern scientific data analysis• increasingly data-driven• data often consist of massive spatio-temporal streams

– Research focus• characterizing spatio-temporal structure in data• statistical models for object shapes, trajectories, patterns...• data mining from scientific data streams (NSF, Optiputer)• recognition of waveforms in time-series archives (JPL,NASA)• inference of dynamic gene-regulation networks from data

(NIH) • Markov models for spatio-temporal weather patterns (DOE)• clustering and modeling of storm trajectories (LLNL)

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

Image-voxel Data(“slices” of olfactory bulb in rats)

Automatic segmentation of cellular structures of interest(glomelular layer)

Thematic mapsData miningScientific discovery

Image-voxel Data(Remote sensing AVIRIS spectral data)

Focus of attention on wavelengths of interest

Thematic mapsData miningScientific discovery

What’s wrong with this information flow?

• “One-way”– Flow of information is from data to scientist

• Real scientific investigation is “two-way”• Scientist interacts, explores, queries the data• Most current data mining/analysis tools are relatively poor

at handling interaction– Algorithms are “black-box”, do not allow scientists to be

“in the loop”– Algorithms have no representation of the scientist’s

prior knowledge or goals (no user models)

– OptIPuter project• “next generation” data mining tools for effective exploration

of massive 2d/3d data sets

OptIPuter focus in Data Mining

• Data– 2d (or multi-d) spatio-temporal image/voxel data

• Goals– Allow scientists to explore these massive data sets in an

efficient and flexible manner leveraging the OptIPuter architecture

– Produce interactive software tools that allow scientists to explore massive data in an interactive manner:

• automated segmentation, thematic maps, focus of interest

• Technical Challenges– Scaling statistical algorithms to massive data streams– Providing mechanisms for effective scientific interaction – Developing algorithms for automated “focus-of-attention”

Analysis of Extra-Tropical Cyclones

• Extra-tropical cyclone = mid-latitude storm

• Practical Importance– Highly damaging weather over Europe– Important water-source in United States

• Scientific Importance– Influence of climate on cyclone frequency, strength, etc.– Impact of cyclones on local weather patterns

[with Scott Gaffney (UCI), Andy Robertson (IRI/Columbia), Michael Ghil (UCLA)]

Sea-Level Pressure Data

– Mean sea-level pressure (SLP) on a 2.5° by 2.5° grid– Four times a day, every 6 hours, over 20 years

Blue indicateslow pressure

Winter Cyclone Trajectories

Clustering Methodology

• Mixtures of curves– model as mixtures of noisy linear/quadratic curves

• note: true paths are not linear• use the model as a first-order approximation for

clustering

• Advantages– allows for variable-length trajectories– allows coupling of other “features” (e.g., intensity)– provides a quantitative (e.g., predictive) model– [contrast with k-means for example]

Clusters of Trajectories

Applications

• Visualization and Exploration– improved understanding of cyclone dynamics

• Change Detection– can quantitatively compare cyclone statistics over

different era’s or from different models

• Linking cyclones with climate and weather– correlation of clusters with NAO index– correlation with windspeeds in Northern Europe