data mining: a database perspective present by yc liu
Post on 19-Dec-2015
222 views
TRANSCRIPT
Data Mining:Data Mining:A Database PerspectiveA Database Perspective
Present By YC LiuPresent By YC Liu
ReferenceReference
• Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", Chapter 6.
• M.S. Chen, J. Han, and P.S. Yu., “Data Mining: An Overview from a Database Perspective” , IEEE Transactions on Knowledge and Data Engineering, 8(6): 866-883, 1996.
• J. Liu, Y. Pan, K. Wang, and J. Han, "Mining Frequent Item Sets by Opportunistic Projection," In Proc. of 2002 Int. Conf. on Knowledge Discovery in Databases (KDD'02), Edmonton, Canada, July 2002.
outlineoutline• Introduction• Mining Association Rules• Multilevel Data Generalization,
Summarization, and Characterization• Data Classification• Clustering Analysis• (Pattern-Based Similarity Search)• (Mining Path Traversal Patterns)• (Recommendation)• (Web Mining)• (Text Mining)
Introduction(1/5)Introduction(1/5)• Knowledge Discovery in Databases• A process of nontrivial extraction of
implicit, previously unknown and potentially useful information.
Introduction(2/5)Introduction(2/5)• 主要功用
– 從資料庫中挖掘知識– 了解使用者行為– 幫助企業作決策– 增進商機
• Data Mining 為何興起 ?– 商品條碼之廣泛使用– 企業界之電腦化– 數以百萬計之資料庫正在使用– 多年來累積了大量企業交易資料
Data Knowledge
Introduction(3/5)Introduction(3/5)Data Mining: A KDD ProcessData Mining: A KDD Process
– Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Introduction(4/5)Introduction(4/5) Challenges of Data Mining(1/2)Challenges of Data Mining(1/2)
• Handling of Different Types of Data• Efficiency and Scalability of Data
Mining Algorithms• Usefulness, Certainty, and
Expressiveness of Data Mining Results
• Expression of Various Kinds of Data Mining Requests and Result
Introduction(5/5)Introduction(5/5) Challenges of Data Mining(2/2)Challenges of Data Mining(2/2)
• Interactive Mining Knowledge at Multiple Abstraction Levels
• Mining Information from Different Sources of Data
• Protection of Privacy and Data Security
An Overview of Data Mining An Overview of Data Mining TechniquesTechniques
• Classifying Data Mining Techniques– What kinds of databases to work on
• Relational database, transaction database, spatial database, temporal database.....
– What kinds of knowledge to be mined• Association rules, classification,
clustering...
– What kind of techniques to be utilized• Generalization-based mining, pattern-
based mining, mining based on statistics or mathematical.
Mining Different Kinds of Mining Different Kinds of Knowledge from Databases Knowledge from Databases
– Association Rules– Data generalization, summarization,
and characterization– Data classification– Data clustering– Pattern-based similarity search– Path traversal patterns– Recommendation– Web Mining– Text Mining
Mining Association RulesMining Association Rules
• An association rule is an implication of the form X=>Y, where X I, Y I and XY=.
• The rule X=>Y has support s in the transaction set D if s% of transactions in D contain XY.
• The rule X=>Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y.
What Is Association Mining?What Is Association Mining?• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
• Applications:– For cross-marketing and attached mailing applications. Other
applications include catalog design, add-on sales, store layout and customer segmentation based on buying patterns.
• Examples. – Rule form: “Body Head [support, confidence]”.– buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”)
[1%, 75%]
Association Rule: Basic Association Rule: Basic ConceptsConcepts
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
• Applications– * Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales)– Home Electronics * (What other products should
the store stocks up?)
Rule Measures: Support and Rule Measures: Support and ConfidenceConfidence
• Find all the rules X & Y Z with minimum confidence and support– support, s, probability that a
transaction contains {X∪Y∪Z}– confidence, c, conditional
probability that a transaction having {X∪Y} also contains Z
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Association Rule Mining: A Association Rule Mining: A Road MapRoad Map
• Boolean vs. quantitative associations (Based on the types of values handled)– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”)
[0.2%, 60%]– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of diapers?
• Various extensions– Correlation, causality analysis
• Association does not necessarily imply correlation or causality– Maxpatterns and closed itemsets– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
Mining Association Rules—An Mining Association Rules—An ExampleExample
For rule A C:support = support({A C}) = 50%confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
Mining Association RulesMining Association Rules• Steps for mining association rules -
– Discover all large itemsets– Use the large itemsets to generate the assoc
iation rules for the database
• To Identify The Large Itemset –Algorithm Apriori
Mining generalized and multi-Mining generalized and multi-level association ruleslevel association rules
• Interesting associations among data items often occur at a relatively high concept level
Interestingness of Discovered Interestingness of Discovered Association RulesAssociation Rules
• Example 1: (Aggarwal & Yu, PODS98)– Among 5000 students
• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal
– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
Interestingness of Discovered Interestingness of Discovered Association RulesAssociation Rules
• An association rule “A=>B” is interesting if its confidence exceeds a certain measure, or
where d is a suitable constant.
dBPAP
BAP
)(
)(
)(
Improving the Efficiency of Improving the Efficiency of Mining Association RulesMining Association Rules
• Database Scan Reduction– FP-tree......
• Sampling• Incremental Updating of Discovered
Association Rules• Parallel Data Mining
ClassificationClassification
• A process of learning a function that maps a data item into one of several predefined classes.
• Every classification based on inductive-learning algorithms is given as input a set of samples that consist of vectors of attribute values and a corresponding class.
• predicts categorical class labels• classifies data (constructs a model) based
on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Classification Process (1): Classification Process (1): Model ConstructionModel Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Classification Process (2): Classification Process (2): Use the Model in PredictionUse the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Data ClassificationData Classification
• Decision-tree-based Classification Method– Decision Tree Learning System, ID3– Evaluation Functions
• Information Gain
•Gini Index
n
jp jTgini
1
21)(
))ln(( ii ppi
Training DatasetTraining Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3
Output: A Decision Tree for Output: A Decision Tree for ““buys_computer”buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Performance ImprovementPerformance Improvement• Database Indices• Attribute-oriented Induction• Two-phase Multiattribute Extraction
– Inference Power– Feature Extraction Phase– Feature Combination Phase
Clustering AnalysisClustering Analysis• Clustering:
The process of grouping physical or abstract objects into classes of similar objects.
• Clustering Analysis:to construct meaningful partitioning of a large set of objects based on a “divide and conquer” methodology.
• Method:– Statistic Analysis (Bayesian Classification
Method)– Probability Analysis
Clustering Based on Randomized Clustering Based on Randomized SearchSearch
• PAM (Partitioning Around Medoids)
• CLARA (CLustering LARge Application)
• CLARANS (Clustering Large Applications Based Upon RANdomized Search)
PAM (Partitioning Around MedoidPAM (Partitioning Around Medoids) (1987)s) (1987)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus• Use real object to represent the cluster
– Select k representative objects arbitrarily– For each pair of non-selected object h and selected objec
t i, calculate the total swapping cost TCih
– For each pair of i and h, • If TCih < 0, i is replaced by h• Then assign each non-selected object to the most simi
lar representative object– repeat steps 2-3 until there is no change
PAM Clustering: PAM Clustering: Total swapping cost Total swapping cost TCTCihih==jjCCjihjih
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
ih
t
Cjih = 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
i t
j
Cjih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
ih j
Cjih = d(j, h) - d(j, t)
CLARACLARA (Clustering Large (Clustering Large Applications) (1990)Applications) (1990)
• CLARA (Kaufmann and Rousseeuw in 1990)– Built in statistical analysis packages, such as S+
• It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM• Weakness:
– Efficiency depends on the sample size– A good clustering based on samples will not necessarily repre
sent a good clustering of the whole data set if the sample is biased
Focusing MethodsFocusing Methods
• Focusing Methods– CLARANS assumes that all the objects to be clustere
d are all stored in main memory– The most computationally expensive step of CLARA
NS is calculating the total distances between the two clusters
– Reducing the number of objects considered• Only the most central object of a leaf node of the R*-tr
ee are used to compute the medoids of the clusters– Restricting the access
• Focus on Relevant Clusters• Focus on a Cluster
BIRCH(Balanced Iterative ReducinBIRCH(Balanced Iterative Reducing and Clustering)g and Clustering)
• An incremental one with the possibility of adjustment of memory requirements to the size of memory that is available
• Clustering Features– Summarize information about the subclusters of p
oints instead of storing all points• CF Trees
– Branching factor B and threshold T• By changing the threshold value we can change the s
ize of the tree– Use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
Clustering Feature VectorClustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)(2,6)(4,5)(4,7)(3,8)
CF TreeCF TreeCF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
Data Generalization, Data Generalization, Summarization, and Summarization, and
CharacterizationCharacterization• Data Generalization:
A process which abstracts a large set of relevant data in a database from a low concept level to relatively high ones
• Approaches1. Data Cube Approach2. Attribute-oriented Induction Approach
Data Cube ApproachData Cube Approach• Multidimensional database, OLAP, ....• The general idea of the approach is to
materialize certain expensive computation that are frequently inquired– Such as count, sum, average, max, min,...– Fast response time and flexible views of data
from different angles at different abstraction levels
Attribute-oriented Induction Attribute-oriented Induction ApproachApproach
• Essential Background Knowledge:Concept Hierarchy
• Steps: #– Retrieval initial relation– Attribute Removal– Concept-tree climbing– Vote propagation– Threshold control– Rule transformation
Concept Hierarchy and Concept Hierarchy and Concept-Tree Concept-Tree
• 概念階層在歸納之前必須先定義清楚,最一般化的概念是以” ANY” 或是” ALL” 來表示,最特定的概念是對應到資料庫中該屬性的某一特定值。如屬性 Birth place 的概念階層可表示為
exampleexample• 假設我們要找出研究生 (graduated student) 的特性法則 :
exampleexample• 屬性的概念階層表格 (Concept Hierarchy Table)
exampleexample• 將資料庫中屬性 Status 是 Graduate 的過濾出來。
同時表格每一筆資料都加上一欄” Vote” 用來紀錄在歸納過程中,符合該值組的原始資料筆數。
Example-attribute removalExample-attribute removal
• 將所有屬性中,沒有存在較高概念階層的屬性刪除。
Example-Concept-tree Example-Concept-tree Climbing and Vote PropagationClimbing and Vote Propagation• 假設某一個屬性在概念階層中存在著更高層級的概
念,則該屬性就以其更高層級的值取代。如此例中的 history, physics, math... 會由 science 取代 ...
• 屬性值向上爬升後,若產生相同的 tuples ,則將相同的 tuples 合併為一筆,並將 vote 值累加到歸納後的 tuple 中。
Example-Concept-tree Example-Concept-tree Climbing and Vote PropagationClimbing and Vote Propagation
Example-Threshold Control and Example-Threshold Control and Rule TransformationRule Transformation
• 歸納完成
• 門檻控制 (Threshold Control )