scientific applications of data mining

Scientific Applications of Data Mining

Bioinformatics Seminar

August 28, 2002

Gary Lindstrom

School of Computing

University of Utah

Outline

What is data mining?Where has it been successfully

applied?How can it be applied to scientific

applications?Research Opportunities

What Is Data Mining?

One definition (Robert Grossman)• Data mining is the semi-automatic

discovery of patterns, associations, anomalies, structures, and changes in large data sets

Data Mining

Characteristics • Large data, vs. small data• Discovery, not validation• Data driven, not hypothesis driven• Automated, not manual application

Supported by• Statistics, machine learning, databases,

high performance computing

The Data Gap

Exponential growth of data• More automation, greater throughput,

more models, e.g. simulated

But: linear increase in number of researchers• Sift the sand, rather than searching a

sensor

Classical Data Mining Applications

Retail• Market basket analysis

Political science• Targeting campaign resources

Financial• Exploiting market trends & imbalances

Decision Support Systems

Generic term for analytic and historic uses of DBs• Contrast with: operational uses• Commonly known as On-Line

Transaction Processing (OLTP)

Data warehouses• Data culled from operational DBs, with

history and derived summary data

Data Warehouses vs. Databases

• Replicate data from distributed sources• Do not require strict currency of data• Oriented toward complex, often

statistical queries• Often based on materialized views of

operational data Views which have been expanded into real

tables

Tools for DSS Ad hoc SQL-style queries

• Optimized for large, complex data On-Line Analytic Processing (OLAP)

• Queries optimized for aggregation operations• Data is viewed as multidimensional array• Influenced by end-user tools such as spreadsheets

Data mining • Exploratory data analysis• Looking for interesting unanticipated patterns in

the data

Data Warehousing

External Data Source

Metadata Repository

SERVES OLAPEXTRACTTRANSFORMLOADREFRESH

Data Warehouse

Data Mining

Visualization

Creating And Maintaining A Warehouse

Challenges• Schema design for integrated information• Operations

Cleaning (curation): filling gaps, correcting errors Transforming: making consistent with new schema Loading: also sorting and summarizing Refreshing: incorporate updates to operation data Purging: aging out old data

Role of metadata• Sources of data, schema conversion

information, refresh history, etc.

OLAP Naturally Leads to Data Mining

Seeks interesting trends or patterns in large datasets• An example of exploratory data analysis• Related to knowledge discovery and machine

learning

Mining for rules• Association rules: motivated by retail market

basket analysis

Market Basket Analysis

Market basket• A collection of items purchased by a customer

in one transaction• Retailers want to learn of items often

purchased together For promotional and display grouping purposes

• Simple tabular representation Purchases(transid, custid, date, item, price, quantity)

Association Rules

Seek rules of the form:{ pen } => { ink }

• Meaning: If a pen is purchased in a transaction, it is

likely that ink will also be purchased in that transaction

Important Measures for Association Rules

Support• % of transactions containing all items

mentioned in rule• Low support reduces interest in the rule

Confidence• % of transactions containing the LHS

that also contain RHS• Indicates degree of correlation

Using Association Rules For Prediction

Always somewhat risky• Because ultimate goal is understanding

causality• Which is not directly reflected in

transaction data

There Can Be High Support and Confidence

… but no causality Example: pencils and pens are often

bought together• And pens and ink are often bought together• Hence pencils and ink are often bought

together But there is no causal link between pencils

and ink• Hence sale promotions on pencils and ink

probably won’t be effective

Finding Association Rules

Seek rules with:• Support greater than minsup• Confidence greater than minconf

Steps• Find frequent item sets

Sets of items with support >= minsup

• Break each frequent item set into LHS and RHS of candidate rules Keep those with confidence >= minconf

Testing Candidate Rules

Confidence calculation for each candidate rule• Maintain two counters: lhscount,

rhscount• Scan entire customer transaction table• Count in lhscount occurrences of all

items in LHS• If LHS is present, tally in rhscount if all

items in RHS are present

Identifying Frequent Item Sets

The a priori property:• Every subset of a frequent item set is

also a frequent item set

This leads to an iterative algorithm• Identify frequent item sets of one item• Iteratively, seek to extend frequent item

sets by adding an item

Finding Frequent Itemsets

foreach item, check if it is a frequent itemsetrepeat foreach new frequent itemset Ik with k items generate all itemsets Ik+1 with k+1 items, Ik Ik+1

Scan all transactions once and check if the generated k+1-itemsets are frequentuntil no new frequent itemsets are found

foreach item, check if it is a frequent itemsetrepeat foreach new frequent itemset Ik with k items generate all itemsets Ik+1 with k+1 items, Ik Ik+1

Scan all transactions once and check if the generated k+1-itemsets are frequentuntil no new frequent itemsets are found

Example: Mining Simulated Combustion Data

Joint work with• Brijesh Garabadu, School of Computing• Zoran Djurisic, Chem. & Fuels Engg.

The problem• Combustion model for powdered coal

furnaces• Which conditions control NOx pollution?

The Data

Multidimensional space• Pressure, fuel mix, oxygen concentration• Can explore (simulate) any combination

But which to look at?

Need to:• Locate relevant subspaces• Characterize important events• Develop causal hypotheses

Techniques Applied

Cluster analysis• Which datasets are similar?

Neural networks• Which datasets are interesting?

Decision trees• Which features best explain similarities?

Cluster Analysis: Unsupervised Learning

At outset, category structure of the data is unknown• All that is known is a collection of

observations

Objective: To discover a category structure which fits the observation• i.e. finding natural groups in data

Combustion Application

Cluster analysis was used to detect relationships among various species• Are the behaviors of any two species related? • Is the concentration of one species dependent

on that of one or more other species? One confirmed hypothesis:

• CH reaches it peak concentration either before or at the same time as H reaches its peak concentration

• An important engineering observation

Artificial Neural Networks A general, practical method for learning

real-valued, discrete-values, and vector-values function from examples

Combustion application• Finding out different kinds of pattern

(increasing / decreasing, etc) in the lifetime of a species during the combustion process

• This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data

Neural Networks: Supervised Learning

Application Technique

Training set data are labeled by the user• These labeled data are used to train the ANN

The ANN is then used to classify previously unseen data• e.g., species in a particular combustion• Into a particular pattern class

For example, NO shows two different trends under differing conditions

A trained ANN can be used to classify the datasets according to the trend of NO

Decision Trees

Characterize data by features• e.g., species concentration at an instant

Categorize data sets• Manually, or use ANN• e.g., according to the trend of NO

Use decision tree algorithm to discover clustering criteria

Sample Output

=== Classifier model (full training set) ===J48 pruned tree---------------------CO <= 0.002945| OH <= 0.000016| | CO <= 0.000166: yes (17.0/1.0)| | CO > 0.000166: no (3.0)| OH > 0.000016: yes (30.0)CO > 0.002945: no (60.0 / 1.0)

Research Opportunities

Try it!• In your area, on your data, for new

results

Features• Definition, efficient extraction

Community building• Sharing data mining results

PMML

Predictive Model Markup LanguageXML based representation of

association rulesDeveloped by Data Mining Group

• Industrial and university research collaboration

An Excellent Tutorial

Used for material in this talk• Data Mining Scientific and Engineering

Applications Tutorial at SC2001, November 12, 2001 by

R. Grossman, C. Kamath and V. Kumar

http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html

http://www-users.cs.umn.edu/

scientific applications of data mining

Technology