data assimilation vs data mining

28
Data Mining vs. Data Assimilation S. Lakshmivarahan School of Computer Science University of Oklahoma Norman, Oklahoma [email protected] S. Lakshmivarahan  Data Mining vs. Data Assimilatio n

Upload: zeaesther

Post on 16-Feb-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 1/28

Data Mining vs. Data Assimilation

S. Lakshmivarahan

School of Computer Science

University of Oklahoma

Norman, Oklahoma

[email protected]

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 2/28

Data Mining(DM) - early beginnings

Much of what we know in physical sciences had their originsin Astronomy - with observations of celestial objects

Thanks to the Herculean efforts of:

Copernicus (1473-1543)Galileo (1544-1642)Kepler (1571-1630)Newton (1643-1727)

This is only a small sampling from a long list of pioneers

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 3/28

Discovery of simple laws from observations

Observations collected over decades were meticulouslyanalyzed by hand to formulate new laws of nature

Examples:

Heliocentric systemThe Four laws of Kepler

Law of gravitation by NewtonThree Newtons laws

Within the context of physical sciences these are some of theearliest examples of data mining

Note: In Chemical, Biological and other Sciences there are

instances such as the above that are re pleat with historicalfacts that can illustrate the use of data mining in each of these disciplines

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 4/28

What is Data Mining

DM is the process extracting the structure or patterns that areinherent in the data/observations

These patterns provide clues about the data generatingprocess

Ultimate goal of DM is to understand and quantify the datagenerating process

Since the motion of celestial objects inherently followedcertain laws, early pioneers with their hard work and ingenuity

could discover the laws that laid the foundation of thephysical sciences and engineering as we know today

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 5/28

Abundance of data - revival of DM

Volume of data collected doubles in every three years -Thanksto technology

ComputersLarge scale storage device technologyCommunication and sensor technologies

Today interest in DM include:Physical sciences, Biological sciences, Medical SciencesSpace exploration, All branches of Engineering,Environmental Sciences, EcologyEconomics, Social Sciences, Finance, Banking and Commerce,

Sports and recreationGovernments, private companies

More about DM a bit later. Back to early Astronomy

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 6/28

Development of Calculus and discovery of dynamic models

Introduction of mathematical models by combining concurrentdevelopments in

Physical laws - Newton’s lawsCalculus by Newton (1643-1727) and Leibnitz (1646-1716)

among others

Naturally lead to the development of dynamic models todescribe the motion of planets around the sun

With the availability of models, the potential for forecast or

prediction became very clear

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 7/28

Discovery of Least squares- Beginnings of DataAssimilation (DA)

Gauss (1777-1855) (when he was only 24 years old) using theknown models of his time, undertook the challenging problemof predicting when the celestial object called Ceres willreappear on the telescope

The model had unknown parameters that needed to beestimated

By combining the model with observations in the least squaressense, Gauss, estimated the unknown parameters - created thefirst assimilated model

He then used this assimilated model to accurately predict of the time and location of reappearance of the lost astronomicalobject

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 8/28

Gauss laid the foundation for DA

This work leads the development of the method of leastsquares as we know it today

Method of least squares still continues to dominate the theory

and practice of estimation of unknown parametersBy this time Gauss had also invented the notion of statisticalanalysis relating to the distribution observational errorsfollowing the bell shaped curve which we now call as thenormal  or Gaussian  distribution

S. Lakshmivarahan   Data Mining vs. Data Assimilation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 9/28

What is Data Assimilation?

Fusion of model with dataModels are general descriptions of the underlying physicalprocesses in question

Model represents a class - suitably parametrized

Examples:Static regression models have unknown coefficientsDynamic model has unknown initial/boundary conditions +physical parameters such as Reynolds number, coefficient of thermal expansion of water, specific heat of water, etc

Data/observations reveal all the secrets of or the truth aboutthe process that model tries to capture

S. Lakshmivarahan   Data Mining vs. Data Assimilation

DA f i f d l i h d

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 10/28

DA - fusion of models with data

By combining models and data - estimating the unknownparameters of the models using the data - we can get aspecialized instantiation of the model called the assimilatedmodel

This assimilated model is a good tool for creating forecast orprediction

One of the standard tools for the fusion of model and data isbased on the method of least squares

The discipline of DA primarily deals with development of methods for assimilating models with data

S. Lakshmivarahan   Data Mining vs. Data Assimilation

G l f DA d f / di i

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 11/28

Goal of DA - generate good forecast/prediction

Predict the path of a hurricane, tornado- using one of severalmodels + data collected using satellites, Radars, specialplanes that fly into the hurricanes twice a day

From the crime scene data, reconstruct the case - CSI, Miami

NTSB estimate the causes of failure using the data from thedebris

Predict the potential tax revenues so that a Government candevelop its budget for the next year

Medical diagnosis - from symptoms to the cure

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Di t I bl A l ifi ti

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 12/28

Direct vs. Inverse problems - A classification

To further explore the relation between DM and DA -introduce an useful classification

Scientific and Engineering problems can be classified into oneof two typesDirect  problems - Examples

Given a polynomial  p (x ), evaluate it at  x  = 1.0Given a differential equation and the initial condition, find the

solutionGiven a matrix A and a vector x, compute the vector  b  = Ax 

Inverse problems - ExamplesGiven a polynomial p(x), solve for the roots of  p (x ) = 0Given a differential equation and a particular solution, find the

initial condition that corresponds to the solutionGiven a matrix A and a vector b, find the solution x such thatAx  =  b 

It turns out that DM and DA naturally correspond to twotypes of inverse problems and prediction is a direct problem

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Fi t l l f i bl Th C f D t Mi i

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 13/28

First level of inverse problems - The Core of Data Mining

At the highest level, Data Mining relates to solving the

important class of   inverse  problems leading to the discoveryof basic laws/models that are implied by the data

Examples of discovery of laws/models from data include:

Basic laws in early Astronomy -Kepler, Newton,Atom models in early 1900s

Higgs Boson, the so called God particle in 2012Theory of evolution by C. DarwinBuilding models to identify credit card fraudBased on the observed structure of the autocorrelation of atime series, decide on the class and the type of model that

might be capture the observed autocorrelationData Mining has been and still continues to be the basis forthe advancement of knowledge in all of Sciences andEngineering

S. Lakshmivarahan   Data Mining vs. Data Assimilation

S d l l f i s bl s Th C f D t

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 14/28

Second level of inverse problems - The Core of DataAssimilation

Assume now that the newly discovered mathematical laws areexpressed in the form of a class of models

The problem then becomes one of data assimilation thatrelates to solving a second level of   inverse  problem that dealswith the estimation of the unknown parameters of the modelusing the same or similar data

Determination of the weights for links connecting the neuronsin an Artificial Neural Network - minimize classification errorEstimate the sea surface temperature using satelliteobservations - based on Planck/Stefan’s law of radiation

Estimate the amount of rain in a cloud system using radarobservation - based on an empirical lawEstimate the structure of the earth - based on the anomaly of the local gravitational field - basis for geophysical exploration

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Third level involves Prediction a direct problem

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 15/28

Third level involves Prediction - a direct problem

Once an assimilated model is made available, interest thenshifts to the  direct  problem of generation of short termprediction

Predict lunar/solar eclipsePrediction of total revenue by a state treasuryPrediction of how snow will fall in Boston due to a coastal lowpressure systemPrediction of the amount of green houses in the atmosphere by2025

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Is it DM or DA?

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 16/28

Is it DM or DA?

First phase: At its core DM relates to the discovery of basicknowledge - Remember Kepler and Newton

This knowledge is often expressed as a law which isencapsulated in a (mathematical) model

Emphasis then shifts to testing the goodness of a model

Second phase: At its core DA deals with the problem of estimating the unknowns by fitting the model to data -Remember Gauss

Third phase: Using the assimilated model generate forecast

products for public consumption

DM and DA are the two parts of a continuum

S. Lakshmivarahan   Data Mining vs. Data Assimilation

A classification of models

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 17/28

A classification of models

Models: Based on causality (Motion of a Hurricane) vscorrelation (ARMA model in time series)

Models: Explicit (ARMA model) vs implicit (Neural Networks)

Static (Regression) vs. Dynamic (ODE/PDE)

Models: Deterministic (motion of a planet) vs. stochastic(evolution stock prices)

Model: Linear vs. nonlinear

Model Time: Discrete (unemployment) vs. continuous(temperature)

Model Space: Discrete (Markov chain) vs. continuous (rainfall)

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Forms of Data

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 18/28

Forms of Data

Data arise in various forms:

Time series data - annual rain fall, total monthly sales

Data martix  m × n  - n objects (columns) and m attributes(rows)

Cross Sectional data - Tabular forms

Practical problems: Missing data, outliers, Data qualitycontrol

Note: In Science and Engineering, data are often of thequantitative type (permiting full blown arithmetic

operations). In Economics, Social Sciences etc., data could bea mixture of both  quantitative  and  qualitative  types.Algorithms for mining/assimialtion qualitative data differ fromthose of quantitative data sets

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Estimation - Over vs under determined problems

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 19/28

Estimation - Over vs. under determined problems

Two scenarios arise depending on the cost of collectingobservations

Over determined (OD) case - abundance of observation muchlarger than the number of unknowns to be estimated - Oncedeployed, satellites, radars will deliver large amounts of datafor quite a long time

Under determined (UD) case - less number of observationscompared with the number of unknowns - Exploration forminerals, natural gas, oil, etc.,

In the OD case there is no solution and in the UD case thereare infinitely many solutions

These cases are the motivation for the definition of solution inthe least squares sense

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Framework for DA

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 20/28

Framework for DA

Estimation problem is recast as a constrained minimizationproblem

Constraints arise naturally:

Positivity of certain physical parameters - inequality constraint

Model itself acts as a constraint - equality constraint

Strict enforcement of constraints - Strong constraintformulation -Lagrangian multiplier technique

Weak enforcement of constraints - Weak constraint

formulation - Penalty function technique

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Well-posed vs ill-posed problems

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 21/28

Well posed vs. ill posed problems

In a well posed problem solution exits and is uniqueIn an ill-posed problem solution may not exist or it may haveinfinitely many solutions

Many of the inverse problems are ill-posed

These are solved by using some form of regularizationtechniques - Tikhonov regularization

Using regularization we solve the nearest well-posed version of a given ill-posed problem

Example: Solving (A +  αI )X (α) =  b   instead of  AX  =  b   forsome small positive   α  for which (A +  αI ) is positive definite

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Methods for estimation

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 22/28

Methods for estimation

Parametric vs. non-parametric methods

Least squares - two versions

Unweighed least squares - orthogonal projectionWeighted least squares - oblique projections

Generalized method of moments

Maximum likelihood methods

Bayesian methods where we combine a known prior withconditional distributions

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Optimization problems

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 23/28

Optimization problems

Unimodal vs. multi modal problems

Continuous vs. discrete optimization problems

Continuous,Unimodal problems solved using:

Gradient method

Conjugate gradient methodQuasi-Newton method

Continuous multi modal and discrete optimization problemssolved using randomized techniques:

Simulated annealing

Genetic algorithms

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Methods for DM/DA - I

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 24/28

/

Time series analysisSignal processing in EEMedicineEconometrics, Finance

The goal is to build stochastic dynamic models in discretetime by exploiting the underlying correlation, seasonalityproperties of the data set

In Finance model both level and volatility

Autoregressive, integrated, moving average (ARIMA) models

This is one of the well developed areas in empirical modeling

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Methods for DM/DA - II

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 25/28

/

Multivariate regression analysis (1800s) Statistics

Data reduction using PCA (1940), ICA (1990) - StatisticsClassification using

Clustering (1950s)Neural networks (1950s),Pattern recognition (1950s),

Support Vector Machines (SVM) (1980s)

Association rules

Image processing, voice recognition

Decision trees (1960)

Probabilistic reasoning in networks (1990s) - J. Pearl TuringAward in 2012

Random field - Spatial data analysis

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Commonality of approaches in DM, DA, AI,Machine

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 26/28

y pp , , ,Learning

Supervised learning

Learning with a teacher - Learning in Neural Networks

Learning with a probabilistic teacher - using impreciseknowledge

Unsupervised learning/Learning without a teacher -Clustering, Adaptive Control

S. Lakshmivarahan   Data Mining vs. Data Assimilation

Summary

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 27/28

y

At the first level Data Mining seeks to uncover the basic lawsthat are hidden in the data. These laws are presented bymodels of some kind with unknown parameters

At the second level Data Assimilation deals with the task of 

fusing data with models to produce an assimilated model - byestimating the unknown parameters

At the third level, using the given assimilated model producevarious forecast products for public consumption

DM, DA and Forecasting are the three parts of a continuumin knowledge discovery

S. Lakshmivarahan   Data Mining vs. Data Assimilation

References

7/23/2019 Data Assimilation vs Data Mining

http://slidepdf.com/reader/full/data-assimilation-vs-data-mining 28/28

J. M. Lewis, S. Lakshmivarahan and S. K. Dhall (2006)Dynamic Data Assimilation: a least squares approach, Volume104, Encyclopedia of Mathematics and its Applications,Cambridge University Press, 654 pages

J. D. Hamilton (1994) Time Series Analysis, PrincetonUniversity Press, 799 pages

P. Tang, M. Steinbach and V. Kumar (2006) Introduction toData Mining, Addison Wesley

S. Lakshmivarahan   Data Mining vs. Data Assimilation