二〇二三年四月十九日 Introduction to Data Mining 1
Introduction to Data Mining
Chen. Chun-Hsien
Department of Information Management
Chang Gung University
二〇二三年四月十九日 Introduction to Data Mining 2
Outline
Motivation to data mining What is data mining? Applications of data mining Data mining process Main data mining techniques Classification of data mining
systems
二〇二三年四月十九日 Introduction to Data Mining 3
Motivation
Data explosion problem Automated data collection tools and mature database
technology 1 million new transactions/per hour in Walmart database Tremendous amount of Web pages
40 billion photos on Facebook Big data in Clouds
We are drowning in data, but starving for knowledge to make decision
Solution: Data Mining One of the 10 emerging technologies that will change
the world in the near future (MIT Technology Review)
二〇二三年四月十九日 Introduction to Data Mining 4
What Is Data Mining?
Formal Definition of Data mining
Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, affinities) from large amount of data
Alternative names Business intelligence, knowledge discovery in databases
(KDD), data/pattern analysis, knowledge extraction, data
dredging, information harvesting, data archeology, etc.
Knowledge
二〇二三年四月十九日 Introduction to Data Mining 5
Example : Mining a Concept Hierarchy
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
二〇二三年四月十九日 Introduction to Data Mining 6
Part of International Sales Data
Region Country City OfficeNorth American USA New York QueenNorth American Canada Vancouver L. ChanNorth American USA L.A. Bay AreaNorth American USA Boston Northern AreaNorth American Canada Toronto CentralNorth American USA Boston Southern AreaNorth American USA New York QueenNorth American USA L.A. Bay AreaNorth American Mexico Mexico City EmpireNorth American Canada Toronto CentralNorth American USA New York Manhattan
二〇二三年四月十九日 Introduction to Data Mining 7
Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
ArtificialIntelligence
InformationScience
MachineLearning Visualization
二〇二三年四月十九日 Introduction to Data Mining 8
Evolution of Database Technology
1960s: Data collection, database creation, network DBMS
1970s: Relational data model, relational DBMS
1980s: Advanced data models (extended-relational, OO, spatial,
temporal D/Bs, etc.)
1990s ~: Data mining, data warehousing, multimedia D/B, and
Web
二〇二三年四月十九日 Introduction to Data Mining 9
Applications of Data Mining
Decision support Consumer understanding, service improvement Market trend analysis and management Risk analysis and management Fraud detection and management Medical decision support systems
Other Applications Text mining Web analysis Biomedical informatics
二〇二三年四月十九日 Introduction to Data Mining 10
Market Analysis and Management
Data sources for analysis Credit card transactions, customer questions
(FAQ), customer complaint calls, public lifestyle studies
Market basket analysis and cross selling Associations/co-relations between product sales Prediction based on the association information
(1/2)
二〇二三年四月十九日 Introduction to Data Mining 11
Customer profiling Find clusters of “model” customers who share the same
characteristics: interest, spending habits , income level, etc.
Data mining can tell you what types of customers buy what
products (by clustering or classification techniques)
Identifying personalized customer requirements Identifying the best products for different customers
Use prediction to find what factors will attract new customers
Market Analysis and Management
(2/2)
二〇二三年四月十九日 Introduction to Data Mining 12
Risk Management and Analysis
Finance planning and asset evaluation Cash flow analysis and prediction Asset evaluation Time series analysis (trend analysis)
Competitive analysis and market
segmentation Monitoring market directions and competitors Setting pricing strategy in a highly competitive market
Grouping customers/a class-based pricing procedure
(Multi-brand, multi-style strategies)
二〇二三年四月十九日 Introduction to Data Mining 13
Fraud Detection and Management
Applications Health care, credit card services
Approach use historical data to build models of fraudulent
behavior and use data mining to help identify similar instances
Examples money laundering: detect suspicious money transactions medical insurance: detect professional patients and ring
of doctors
二〇二三年四月十九日 Introduction to Data Mining 14
Other Applications
Text Ming News classification : find related articles CRM data analysis : analyze customer Q&As Medical informatics : automatic classification of medical
reports
Web Mining : mining web access logs Analyzing effectiveness of web marketing Improving Web site organization Discovering customer preference and behavior
Biomedical Informatics Finding related genes of genetic diseases Drug discovery
二〇二三年四月十九日 Introduction to Data Mining 15
Relevant Data
Data Preprocessing
Data Mining
Evaluation/Presentation
Pattern
Knowledge
Databases
Steps in KDD Process(Technically)
Data mining
The core step of KDD process
二〇二三年四月十九日 Introduction to Data Mining 16
Main Steps of a KDD Process(Fully)
Domain knowledge Acquisition Learning relevant prior knowledge and goals of application
Data collection and preprocessing (may take 60% of effort!)
Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction (in Cloud)
Data mining Choosing functions of data mining
association, classification, clustering, regression, summarization.
Choosing the mining algorithm(s) Searching for knowledge patterns of interest
Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
二〇二三年四月十九日 Introduction to Data Mining 17
Mining On What Kind of Data?
Relational databases Data warehouses Transactional databases Advanced DB and information repositories
Spatial databases Time-series data (temporal data) Text databases and multimedia databases Object-oriented databases Heterogeneous and legacy databases Web sites
二〇二三年四月十九日 Introduction to Data Mining 18
Relevant Data
Data Preprocessing
Databases
Steps in KDD Process
二〇二三年四月十九日 Introduction to Data Mining 19
Why Data Preprocessing?
Data in the real world is dirty incomplete
lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
noisycontaining errors or outliers
inconsistentcontaining discrepancies in codes or names
No quality data, no quality mining results! Quality decisions must be based on quality data
二〇二三年四月十九日 Introduction to Data Mining 20
Major Tasks in Data Preprocessing
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
二〇二三年四月十九日 Introduction to Data Mining 21
Relevant Data
Data Preprocessing
Data Mining
Pattern
Databases
Steps in KDD Process
二〇二三年四月十九日 Introduction to Data Mining 22
Main Data Mining Techniques
Association Rule Mining
Classification and Prediction
Cluster Analysis
Regression Analysis
Outlier Analysis
Trend Analysis
二〇二三年四月十九日 Introduction to Data Mining 23
Main Data Mining Techniques
Association Rule MiningFind association rules (correlation and causality)
Form of association rules : X Y [s, c] Simple form example : computer software [s= 1%, c =
75%] Detailed form examples
sales(T, “computer”) sales(T, “software”) [support = 1%, confidence = 75%]
buy(T, “Beer”) buy(T, “Diaper”) [support = 2%, confidence = 70%]
age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”) [support = 2%, confidence = 60%]
(1/5)
二〇二三年四月十九日 Introduction to Data Mining 24
Association Rule Mining(Support and Confidence)
Given a transaction D/B, find all the rules X Y with minimum support and confidence
support, S, probability that a transaction contains {X & Y }
confidence, C, conditional probability that a transaction having {X} also contains YTransaction ID Items Bought (T)
0001 A,B,C0002 A,C0003 A,D0004 B,E,F
I = {i1,i2,i3, ...,in} : set of all items
T I : a transaction
A C (50%, 66.6%) C A (50%, 100%)
Customersbuy X
Customersbuy both
Customersbuy Y
二〇二三年四月十九日 Introduction to Data Mining 25
Use a training set to construct a model for the outcome
forecast of future events. Two main types Classification
Finding models that distinguish classes for future forecast
e.g., loan approval, customer classification, recognition of finger print
Model representation: decision-tree, neural network
Prediction Prediction: Predict some unknown/missing numerical values for
future forecast
e.g., stock price prediction Model representation: linear regression, neural network
(2/5)Main Data Mining TechniquesSupervised Learning
二〇二三年四月十九日 Introduction to Data Mining 26
Use a training set to construct a model for the outcome
forecast of future events Classification
predicts categorical class labels (mainly for two-class problems) constructs a classification model to classify new data
Prediction predicts numerical values Constructs a continuous-valued (mathematical) function to
predict unknown or missing values Typical Applications
credit card approval medical diagnosis & treatment Pattern recognition
Classification vs. Prediction
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 27
An Example of Classification(Fruit Classifier)
Classifier
output
Class label
oval, red, orange, yellow
shape=roundcolor = red
inputfeatures
Apple
shape=roundcolor = orange
Orange
Mango
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 28
A General Classifier
Classifierinputfeatures output
class label
::
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 29
Model of Supervised Learning
The model is in a form of )...,,,( 21 nxxxfy
Classifier/Predictor
inputfeatures output
::
x1
x2
xn
y
Main issue: • What are x1, …, xn ?• How to get the model f ?• How to collect training data with output y
二〇二三年四月十九日 Introduction to Data Mining 30二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 30
Stages of Classification Tasks (Construction and Usage of Classification
Model)
Model construction
TrainingData(I, O)
ClassificationLearning
Algorithms
ClassifierModel
Model usage
ClassifierModel
inputfeatures output
class label
::
二〇二三年四月十九日Data Mining: Concepts and Technique
s 31
Classification Methods
Decision Tree Induction Algorithms Bayesian Classifiers Back Propagation Neural Networks SVM—Support Vector Machines k-nearest neighbor classifier
二〇二三年四月十九日 Introduction to Data Mining 32
An Example of Training Dataset
age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3
Class label (O)
Input features (I)
二〇二三年四月十九日 Introduction to Data Mining 33
An Example of Classification Model (A decision tree for predicting buys_PC)
no yes fairexcellent
<= 30 > 4030..40
student?
age?
credit rating?
nono yes
yes
yes
: test attribute: class label
: attribute value
?
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 34
Extracting Classification Rules from Tree
Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a
conjunction The leaf node holds the class prediction Rule examples
IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer =
“yes”IF age = “>40” AND rating = “fair” THEN buys_computer = “no”IF age = “>40” AND rating = “excellent” THEN buys_computer = “yes”
二〇二三年四月十九日 Introduction to Data Mining 35二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 35
A Decision Tree for CAD Screening(Constructed from 500 Records)
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 36
Discussion on Extracted Classification Rules
(from Decision Tree for CAD Screening)
在大部分的情形下,綜合高血壓與 hsCRP 的濃度資訊,足以區別的一個人是否是 CAD 的潛在患者,此情形佔 81% ,共 405 例,誤判率是 2.47%(=10/405) 。
結論 : 高血壓與 hsCRP 的濃度是區別一個人是否是 CAD 潛在患者的主要危險因子
部分的情形須進一步綜合 Age 與 HDLc 的資訊,才足以區別的一個人是否是 CAD 的潛在患者,此情形佔 14.6% ,共73 例,誤判率是 5.48%(=4/73) ,誤判率有些升高。
上述兩個情形,共 5 條規則,所涵蓋的比率是 95.6% Rule examples ( 下列 2 條規則所涵蓋的情形佔 81%)
IF 有高血壓 AND hsCRP > 0.316 THEN CAD IF 沒有高血壓 AND hsCRP < 0.545 THEN Noraml
二〇二三年四月十九日 Introduction to Data Mining 37
Cluster analysis (unsupervised learning) Class label is unknown: Group data to form
new classes
e.g., Customer profiling (Amazon.com)
Clustering based on the principle:
Maximizing the intra-class similarity and minimizing the interclass similarity
Mainly for exploratory analysis
(3/5)Main Data Mining TechniquesCluster analysis
二〇二三年四月十九日 Introduction to Data Mining 38
A
B
C
Difficulty : Data distribution of high dimension is not visually visible.
XY
Z3 clusters with points X, Y, and Z as outliers
Example of Cluster Analysis
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 39
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
for the set of records using some criterion
Density-based: Based on connectivity and density functions
Grid-based: Quantize the data space into a finite number of
cells that form a grid structure on which clustering are
performed
Model-based: A model is hypothesized for each of the
clusters and find the best fit of the records to the given models
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 40
Hierarchical Clustering
Use distance matrix as clustering criteria.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
二〇二三年四月十九日Main Data Mining Techniques for Bio
medical Informatics 41
Decompose the data objects into a several levels of tree clusters, called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
A Dendrogram Shows Hierarchically Merged Clusters
二〇二三年四月十九日 Introduction to Data Mining 42
Gene Expression AnalysisGene Expression Analysisby Clusteringby Clustering
Finding differentially regulated genes
Clustering
二〇二三年四月十九日 Introduction to Data Mining 43二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 43
Profile of Stroke PatientsProfile of Stroke Patients((Diagnosed by Indices of Chinese Diagnosed by Indices of Chinese
MedicineMedicine))
二〇二三年四月十九日 Introduction to Data Mining 44二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 44
x
y
y = x + 1
X1
Y1 ?
Main Data Mining TechniquesExample of Linear Regression
(4/5)
• Predict y’s value at X1
using linear regression• y = f (x), what is f ?
二〇二三年四月十九日Data Mining: Concepts and Technique
s 45
Linear regression: Y = + X Two parameters , and specify the line and are
to be estimated by using the training data. Using the least squares criterion to the training
samples: (X1, Y1), (X2, Y2) …, (Xn, Yn)
Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn
Analyze b1 , b2 … bn to find the contribution of each variable
Log-linear models: Example : Estimate probability:
p(a, b, c, d) = αabc abdγacd bcd
log p(a, b, c, d) = log abc +log abd+logγacd +log bcd
Regression Analysis and Log-Linear Models in Prediction
二〇二三年四月十九日 Introduction to Data Mining 46
Outlier analysis Outlier: a data object that does not comply with the general
behavior of the data
It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis
Trend analysis Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Other pattern-directed or statistical
analyses
Other Data Mining Techniques
(5/5)
二〇二三年四月十九日 Introduction to Data Mining 47
Are All the “Discovered” Patterns Interesting?
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
How to screen a large amount of patterns : Interestingness
measures
A pattern is interesting if it is easily understood, potentially useful,
novel, valid on new or test data with some degree of certainty, or it
validates some hypothesis that a user seeks to confirm
Objective vs. subjective measures for pattern screening Objective: based on statistics and structures of data patterns
e.g., support, confidence, etc. Subjective: based on user’s belief in the data,
e.g., unexpectedness, novelty, actionability, etc.
二〇二三年四月十九日 Introduction to Data Mining 48
Can We Find All and Only Interesting Patterns?
Completeness vs. Optimization Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Search for only interesting patterns: Optimization
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the uninteresting
ones.
Generate only the interesting patterns—mining query optimization
二〇二三年四月十九日 Introduction to Data Mining 49
Classification Scheme of DM Techniques
General functionality Descriptive data mining
Predictive data mining
Different views, different classifications Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
二〇二三年四月十九日 Introduction to Data Mining 50
A Multi-Dimensional View of DM Technique Classification
Databases to be mined Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.
Knowledge to be mined Association, classification, clustering, trend, characterization,
discrimination, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels
Techniques utilized Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc. Applications adapted
Retail, banking, stock market analysis, telecommunication, fraud analysis, Web mining, biomedical informatics, etc.
二〇二三年四月十九日 Introduction to Data Mining 51
Summary for Data Mining
Data mining: automatic discovery of interesting knowledge from large amounts of data
A natural evolution of database technology, in great demand, with wide applications
A KDD process includes data pre-processing, data mining, pattern evaluation, and knowledge presentation
Main data mining functions: ARM, classification, clustering, outlier and trend analysis, characterization, etc.
二〇二三年四月十九日
Main Data Mining Techniques for Biomedical Informatics 資料探勘
Thank You !!!!
Have a Nice Day !