data mining: a database perspective present by yc liu

Data Mining:Data Mining:A Database PerspectiveA Database Perspective

Present By YC LiuPresent By YC Liu

ReferenceReference

• Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", Chapter 6.

• M.S. Chen, J. Han, and P.S. Yu., “Data Mining: An Overview from a Database Perspective” , IEEE Transactions on Knowledge and Data Engineering, 8(6): 866-883, 1996.

• J. Liu, Y. Pan, K. Wang, and J. Han, "Mining Frequent Item Sets by Opportunistic Projection," In Proc. of 2002 Int. Conf. on Knowledge Discovery in Databases (KDD'02), Edmonton, Canada, July 2002.

outlineoutline• Introduction• Mining Association Rules• Multilevel Data Generalization,

Summarization, and Characterization• Data Classification• Clustering Analysis• (Pattern-Based Similarity Search)• (Mining Path Traversal Patterns)• (Recommendation)• (Web Mining)• (Text Mining)

Introduction(1/5)Introduction(1/5)• Knowledge Discovery in Databases• A process of nontrivial extraction of

implicit, previously unknown and potentially useful information.

Introduction(2/5)Introduction(2/5)• 主要功用

– 從資料庫中挖掘知識– 了解使用者行為– 幫助企業作決策– 增進商機

• Data Mining 為何興起 ?– 商品條碼之廣泛使用– 企業界之電腦化– 數以百萬計之資料庫正在使用– 多年來累積了大量企業交易資料

Data Knowledge

Introduction(3/5)Introduction(3/5)Data Mining: A KDD ProcessData Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Introduction(4/5)Introduction(4/5) Challenges of Data Mining(1/2)Challenges of Data Mining(1/2)

• Handling of Different Types of Data• Efficiency and Scalability of Data

Mining Algorithms• Usefulness, Certainty, and

Expressiveness of Data Mining Results

• Expression of Various Kinds of Data Mining Requests and Result

Introduction(5/5)Introduction(5/5) Challenges of Data Mining(2/2)Challenges of Data Mining(2/2)

• Interactive Mining Knowledge at Multiple Abstraction Levels

• Mining Information from Different Sources of Data

• Protection of Privacy and Data Security

An Overview of Data Mining An Overview of Data Mining TechniquesTechniques

• Classifying Data Mining Techniques– What kinds of databases to work on

• Relational database, transaction database, spatial database, temporal database.....

– What kinds of knowledge to be mined• Association rules, classification,

clustering...

– What kind of techniques to be utilized• Generalization-based mining, pattern-

based mining, mining based on statistics or mathematical.

Mining Different Kinds of Mining Different Kinds of Knowledge from Databases Knowledge from Databases

– Association Rules– Data generalization, summarization,

and characterization– Data classification– Data clustering– Pattern-based similarity search– Path traversal patterns– Recommendation– Web Mining– Text Mining

Mining Association RulesMining Association Rules

• An association rule is an implication of the form X=>Y, where X I, Y I and XY=.

• The rule X=>Y has support s in the transaction set D if s% of transactions in D contain XY.

• The rule X=>Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y.

What Is Association Mining?What Is Association Mining?• Association rule mining:

– Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

• Applications:– For cross-marketing and attached mailing applications. Other

applications include catalog design, add-on sales, store layout and customer segmentation based on buying patterns.

• Examples. – Rule form: “Body Head [support, confidence]”.– buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”)

[1%, 75%]

Association Rule: Basic Association Rule: Basic ConceptsConcepts

• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto

accessories also get automotive services done

• Applications– * Maintenance Agreement (What the store

should do to boost Maintenance Agreement sales)– Home Electronics * (What other products should

the store stocks up?)

Rule Measures: Support and Rule Measures: Support and ConfidenceConfidence

• Find all the rules X & Y Z with minimum confidence and support– support, s, probability that a

transaction contains {X∪Y∪Z}– confidence, c, conditional

probability that a transaction having {X∪Y} also contains Z

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Association Rule Mining: A Association Rule Mining: A Road MapRoad Map

• Boolean vs. quantitative associations (Based on the types of values handled)– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”)

[0.2%, 60%]– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of diapers?

• Various extensions– Correlation, causality analysis

• Association does not necessarily imply correlation or causality– Maxpatterns and closed itemsets– Constraints enforced

• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Mining Association Rules—An Mining Association Rules—An ExampleExample

For rule A C:support = support({A C}) = 50%confidence = support({A C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Mining Association RulesMining Association Rules• Steps for mining association rules -

– Discover all large itemsets– Use the large itemsets to generate the assoc

iation rules for the database

• To Identify The Large Itemset –Algorithm Apriori

Mining generalized and multi-Mining generalized and multi-level association ruleslevel association rules

• Interesting associations among data items often occur at a relatively high concept level

Interestingness of Discovered Interestingness of Discovered Association RulesAssociation Rules

• Example 1: (Aggarwal & Yu, PODS98)– Among 5000 students

• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal

– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

Interestingness of Discovered Interestingness of Discovered Association RulesAssociation Rules

• An association rule “A=>B” is interesting if its confidence exceeds a certain measure, or

where d is a suitable constant.

dBPAP

BAP

)(

)(

)(

Improving the Efficiency of Improving the Efficiency of Mining Association RulesMining Association Rules

• Database Scan Reduction– FP-tree......

• Sampling• Incremental Updating of Discovered

Association Rules• Parallel Data Mining

ClassificationClassification

• A process of learning a function that maps a data item into one of several predefined classes.

• Every classification based on inductive-learning algorithms is given as input a set of samples that consist of vectors of attribute values and a corresponding class.

• predicts categorical class labels• classifies data (constructs a model) based

on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Classification Process (1): Classification Process (1): Model ConstructionModel Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Classification Process (2): Classification Process (2): Use the Model in PredictionUse the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Data ClassificationData Classification

• Decision-tree-based Classification Method– Decision Tree Learning System, ID3– Evaluation Functions

• Information Gain

•Gini Index

n

jp jTgini

1

21)(

))ln(( ii ppi

Training DatasetTraining Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Output: A Decision Tree for Output: A Decision Tree for ““buys_computer”buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Performance ImprovementPerformance Improvement• Database Indices• Attribute-oriented Induction• Two-phase Multiattribute Extraction

– Inference Power– Feature Extraction Phase– Feature Combination Phase

Clustering AnalysisClustering Analysis• Clustering:

The process of grouping physical or abstract objects into classes of similar objects.

• Clustering Analysis:to construct meaningful partitioning of a large set of objects based on a “divide and conquer” methodology.

• Method:– Statistic Analysis (Bayesian Classification

Method)– Probability Analysis

Clustering Based on Randomized Clustering Based on Randomized SearchSearch

• PAM (Partitioning Around Medoids)

• CLARA (CLustering LARge Application)

• CLARANS (Clustering Large Applications Based Upon RANdomized Search)

PAM (Partitioning Around MedoidPAM (Partitioning Around Medoids) (1987)s) (1987)

• PAM (Kaufman and Rousseeuw, 1987), built in Splus• Use real object to represent the cluster

– Select k representative objects arbitrarily– For each pair of non-selected object h and selected objec

t i, calculate the total swapping cost TCih

– For each pair of i and h, • If TCih < 0, i is replaced by h• Then assign each non-selected object to the most simi

lar representative object– repeat steps 2-3 until there is no change

PAM Clustering: PAM Clustering: Total swapping cost Total swapping cost TCTCihih==jjCCjihjih

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

j

ih

t

Cjih = 0

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

i h

j

Cjih = d(j, h) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h

i t

j

Cjih = d(j, t) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

ih j

Cjih = d(j, h) - d(j, t)

CLARACLARA (Clustering Large (Clustering Large Applications) (1990)Applications) (1990)

• CLARA (Kaufmann and Rousseeuw in 1990)– Built in statistical analysis packages, such as S+

• It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output

• Strength: deals with larger data sets than PAM• Weakness:

– Efficiency depends on the sample size– A good clustering based on samples will not necessarily repre

sent a good clustering of the whole data set if the sample is biased

Focusing MethodsFocusing Methods

• Focusing Methods– CLARANS assumes that all the objects to be clustere

d are all stored in main memory– The most computationally expensive step of CLARA

NS is calculating the total distances between the two clusters

– Reducing the number of objects considered• Only the most central object of a leaf node of the R*-tr

ee are used to compute the medoids of the clusters– Restricting the access

• Focus on Relevant Clusters• Focus on a Cluster

BIRCH(Balanced Iterative ReducinBIRCH(Balanced Iterative Reducing and Clustering)g and Clustering)

• An incremental one with the possibility of adjustment of memory requirements to the size of memory that is available

• Clustering Features– Summarize information about the subclusters of p

oints instead of storing all points• CF Trees

– Branching factor B and threshold T• By changing the threshold value we can change the s

ize of the tree– Use an arbitrary clustering algorithm to cluster

the leaf nodes of the CF-tree

Clustering Feature VectorClustering Feature Vector

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

CF TreeCF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

Data Generalization, Data Generalization, Summarization, and Summarization, and

CharacterizationCharacterization• Data Generalization:

A process which abstracts a large set of relevant data in a database from a low concept level to relatively high ones

• Approaches1. Data Cube Approach2. Attribute-oriented Induction Approach

Data Cube ApproachData Cube Approach• Multidimensional database, OLAP, ....• The general idea of the approach is to

materialize certain expensive computation that are frequently inquired– Such as count, sum, average, max, min,...– Fast response time and flexible views of data

from different angles at different abstraction levels

Attribute-oriented Induction Attribute-oriented Induction ApproachApproach

• Essential Background Knowledge:Concept Hierarchy

• Steps: #– Retrieval initial relation– Attribute Removal– Concept-tree climbing– Vote propagation– Threshold control– Rule transformation

Concept Hierarchy and Concept Hierarchy and Concept-Tree Concept-Tree

• 概念階層在歸納之前必須先定義清楚，最一般化的概念是以” ANY” 或是” ALL” 來表示，最特定的概念是對應到資料庫中該屬性的某一特定值。如屬性 Birth place 的概念階層可表示為

exampleexample• 假設我們要找出研究生 (graduated student) 的特性法則 :

exampleexample• 屬性的概念階層表格 (Concept Hierarchy Table)

exampleexample• 將資料庫中屬性 Status 是 Graduate 的過濾出來。

同時表格每一筆資料都加上一欄” Vote” 用來紀錄在歸納過程中，符合該值組的原始資料筆數。

Example-attribute removalExample-attribute removal

• 將所有屬性中，沒有存在較高概念階層的屬性刪除。

Example-Concept-tree Example-Concept-tree Climbing and Vote PropagationClimbing and Vote Propagation• 假設某一個屬性在概念階層中存在著更高層級的概

念，則該屬性就以其更高層級的值取代。如此例中的 history, physics, math... 會由 science 取代 ...

• 屬性值向上爬升後，若產生相同的 tuples ，則將相同的 tuples 合併為一筆，並將 vote 值累加到歸納後的 tuple 中。

Example-Concept-tree Example-Concept-tree Climbing and Vote PropagationClimbing and Vote Propagation

Example-Threshold Control and Example-Threshold Control and Rule TransformationRule Transformation

• 歸納完成

• 門檻控制 (Threshold Control )

data mining: a database perspective present by yc liu

Documents