scientific applications of data mining
DESCRIPTION
TRANSCRIPT
Scientific Applications of Data Mining
Bioinformatics Seminar
August 28, 2002
Gary Lindstrom
School of Computing
University of Utah
Outline
What is data mining?Where has it been successfully
applied?How can it be applied to scientific
applications?Research Opportunities
What Is Data Mining?
One definition (Robert Grossman)• Data mining is the semi-automatic
discovery of patterns, associations, anomalies, structures, and changes in large data sets
Data Mining
Characteristics • Large data, vs. small data• Discovery, not validation• Data driven, not hypothesis driven• Automated, not manual application
Supported by• Statistics, machine learning, databases,
high performance computing
The Data Gap
Exponential growth of data• More automation, greater throughput,
more models, e.g. simulated
But: linear increase in number of researchers• Sift the sand, rather than searching a
sensor
Classical Data Mining Applications
Retail• Market basket analysis
Political science• Targeting campaign resources
Financial• Exploiting market trends & imbalances
Decision Support Systems
Generic term for analytic and historic uses of DBs• Contrast with: operational uses• Commonly known as On-Line
Transaction Processing (OLTP)
Data warehouses• Data culled from operational DBs, with
history and derived summary data
Data Warehouses vs. Databases
• Replicate data from distributed sources• Do not require strict currency of data• Oriented toward complex, often
statistical queries• Often based on materialized views of
operational data Views which have been expanded into real
tables
Tools for DSS Ad hoc SQL-style queries
• Optimized for large, complex data On-Line Analytic Processing (OLAP)
• Queries optimized for aggregation operations• Data is viewed as multidimensional array• Influenced by end-user tools such as spreadsheets
Data mining • Exploratory data analysis• Looking for interesting unanticipated patterns in
the data
Data Warehousing
External Data Source
Metadata Repository
SERVES OLAPEXTRACTTRANSFORMLOADREFRESH
Data Warehouse
Data Mining
Visualization
Creating And Maintaining A Warehouse
Challenges• Schema design for integrated information• Operations
Cleaning (curation): filling gaps, correcting errors Transforming: making consistent with new schema Loading: also sorting and summarizing Refreshing: incorporate updates to operation data Purging: aging out old data
Role of metadata• Sources of data, schema conversion
information, refresh history, etc.
OLAP Naturally Leads to Data Mining
Seeks interesting trends or patterns in large datasets• An example of exploratory data analysis• Related to knowledge discovery and machine
learning
Mining for rules• Association rules: motivated by retail market
basket analysis
Market Basket Analysis
Market basket• A collection of items purchased by a customer
in one transaction• Retailers want to learn of items often
purchased together For promotional and display grouping purposes
• Simple tabular representation Purchases(transid, custid, date, item, price, quantity)
Association Rules
Seek rules of the form:{ pen } => { ink }
• Meaning: If a pen is purchased in a transaction, it is
likely that ink will also be purchased in that transaction
Important Measures for Association Rules
Support• % of transactions containing all items
mentioned in rule• Low support reduces interest in the rule
Confidence• % of transactions containing the LHS
that also contain RHS• Indicates degree of correlation
Using Association Rules For Prediction
Always somewhat risky• Because ultimate goal is understanding
causality• Which is not directly reflected in
transaction data
There Can Be High Support and Confidence
… but no causality Example: pencils and pens are often
bought together• And pens and ink are often bought together• Hence pencils and ink are often bought
together But there is no causal link between pencils
and ink• Hence sale promotions on pencils and ink
probably won’t be effective
Finding Association Rules
Seek rules with:• Support greater than minsup• Confidence greater than minconf
Steps• Find frequent item sets
Sets of items with support >= minsup
• Break each frequent item set into LHS and RHS of candidate rules Keep those with confidence >= minconf
Testing Candidate Rules
Confidence calculation for each candidate rule• Maintain two counters: lhscount,
rhscount• Scan entire customer transaction table• Count in lhscount occurrences of all
items in LHS• If LHS is present, tally in rhscount if all
items in RHS are present
Identifying Frequent Item Sets
The a priori property:• Every subset of a frequent item set is
also a frequent item set
This leads to an iterative algorithm• Identify frequent item sets of one item• Iteratively, seek to extend frequent item
sets by adding an item
Finding Frequent Itemsets
foreach item, check if it is a frequent itemsetrepeat foreach new frequent itemset Ik with k items generate all itemsets Ik+1 with k+1 items, Ik Ik+1
Scan all transactions once and check if the generated k+1-itemsets are frequentuntil no new frequent itemsets are found
foreach item, check if it is a frequent itemsetrepeat foreach new frequent itemset Ik with k items generate all itemsets Ik+1 with k+1 items, Ik Ik+1
Scan all transactions once and check if the generated k+1-itemsets are frequentuntil no new frequent itemsets are found
Example: Mining Simulated Combustion Data
Joint work with• Brijesh Garabadu, School of Computing• Zoran Djurisic, Chem. & Fuels Engg.
The problem• Combustion model for powdered coal
furnaces• Which conditions control NOx pollution?
The Data
Multidimensional space• Pressure, fuel mix, oxygen concentration• Can explore (simulate) any combination
But which to look at?
Need to:• Locate relevant subspaces• Characterize important events• Develop causal hypotheses
Techniques Applied
Cluster analysis• Which datasets are similar?
Neural networks• Which datasets are interesting?
Decision trees• Which features best explain similarities?
Cluster Analysis: Unsupervised Learning
At outset, category structure of the data is unknown• All that is known is a collection of
observations
Objective: To discover a category structure which fits the observation• i.e. finding natural groups in data
Combustion Application
Cluster analysis was used to detect relationships among various species• Are the behaviors of any two species related? • Is the concentration of one species dependent
on that of one or more other species? One confirmed hypothesis:
• CH reaches it peak concentration either before or at the same time as H reaches its peak concentration
• An important engineering observation
Artificial Neural Networks A general, practical method for learning
real-valued, discrete-values, and vector-values function from examples
Combustion application• Finding out different kinds of pattern
(increasing / decreasing, etc) in the lifetime of a species during the combustion process
• This can be used to prove various hypothesis as well as to detect patterns of specific species in previously unseen data
Neural Networks: Supervised Learning
Application Technique
Training set data are labeled by the user• These labeled data are used to train the ANN
The ANN is then used to classify previously unseen data• e.g., species in a particular combustion• Into a particular pattern class
For example, NO shows two different trends under differing conditions
A trained ANN can be used to classify the datasets according to the trend of NO
Decision Trees
Characterize data by features• e.g., species concentration at an instant
Categorize data sets• Manually, or use ANN• e.g., according to the trend of NO
Use decision tree algorithm to discover clustering criteria
Sample Output
=== Classifier model (full training set) ===J48 pruned tree---------------------CO <= 0.002945| OH <= 0.000016| | CO <= 0.000166: yes (17.0/1.0)| | CO > 0.000166: no (3.0)| OH > 0.000016: yes (30.0)CO > 0.002945: no (60.0 / 1.0)
Research Opportunities
Try it!• In your area, on your data, for new
results
Features• Definition, efficient extraction
Community building• Sharing data mining results
PMML
Predictive Model Markup LanguageXML based representation of
association rulesDeveloped by Data Mining Group
• Industrial and university research collaboration
An Excellent Tutorial
Used for material in this talk• Data Mining Scientific and Engineering
Applications Tutorial at SC2001, November 12, 2001 by
R. Grossman, C. Kamath and V. Kumar
http://www-users.cs.umn.edu/ ~kumar/Presentation/sc2001.html