page 1 data mining. page 2 outline what is data mining? data mining tasks –association...
Post on 20-Dec-2015
218 Views
Preview:
TRANSCRIPT
Page 1
Data Mining
Page 2
Outline
• What is data mining?
• Data Mining Tasks
– Association
– Classification
– Clustering
• Data mining Algorithms
• Are all the patterns interesting?
Page 3
What is Data Mining:
• Huge amount of databases and web pages make information
extraction next to impossible (remember the favored statement: I
will bury them in data!)
• Inability of many other disciplines: (statistic, AI, information
retrieval) to have scalable algorithms to extract information
and/or rules from the databases
• Necessity to find relationships among data
Page 4
What is Data Mining:
• Discovery of useful, possibly unexpected data patterns
• Subsidiary issues:
– Data cleansing
– Visualization
– Warehousing
Page 5
Examples
• A big objection to was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy.
• The Rhine Paradox: a great example of how not to conduct scientific research.
Page 6
Rhine Paradox --- (1)
• David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception.
• He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.
• He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!
Page 7
Rhine Paradox --- (2)
• He told these people they had ESP and called them in for another test of the same type.
• Alas, he discovered that almost all of them had lost their ESP.
• What did he conclude?– Answer on next slide.
Page 8
Rhine Paradox --- (3)
• He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.
Page 9
A Concrete Example
• This example illustrates a problem with intelligence-gathering.
• Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.
• We want to find people who at least twice have stayed at the same hotel on the same day.
Page 10
The Details
• 109 people being tracked.• 1000 days.• Each person stays in a hotel 1% of the time (10 days
out of 1000).• Hotels hold 100 people (so 105 hotels).• If everyone behaves randomly (I.e., no evil-doers) will
the data mining detect anything suspicious?
Page 11
Calculations --- (1)
• Probability that persons p and q will be at the same hotel on day d :– 1/100 * 1/100 * 10-5 = 10-9.
• Probability that p and q will be at the same hotel on two given days:– 10-9 * 10-9 = 10-18.
• Pairs of days:– 5*105.
Page 12
Calculations --- (2)
• Probability that p and q will be at the same hotel on some two days:– 5*105 * 10-18 = 5*10-13.
• Pairs of people:– 5*1017.
• Expected number of suspicious pairs of people:– 5*1017 * 5*10-13 = 250,000.
Page 13
Conclusion
• Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice.
• Analysts have to sift through 250,010 candidates to find the 10 real cases.– Not gonna happen.– But how can we improve the scheme?
Page 14
Appetizer
• Consider a file consisting of 24471 records. File contains at least two condition attributes: A and D
A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
Page 15
Appetizer (con’t)
• Probability that person has A: P(A)=0.6, • Probability that person has D: P(D)=0.02• Conditional probability that person has D provided it has A: P(D|
A) = P(AD)/P(A)=(272/24471)/.6 = .02• P(A|D) = P(AD)/P(D)= .54• What can we say about dependencies between A and D?
A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
Page 16
Appetizer(3)
• So far we did not ask anything that statistics would not have ask. So Data Mining another word for statistic?
• We hope that the response will be resounding NO • The major difference is that statistical methods work
with random data samples, whereas the data in databases is not necessarily random
• The second difference is the size of the data set• The third data is that statistical samples do not
contain “dirty” data
Page 17
Architecture of a Typical Data Mining System
Data Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
Page 18
Data Mining Tasks
• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association
– age(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support = 2%, confidence = 60%]
– contains(T, “computer”) -> contains(x, “software”) [1%, 75%]
– What is support? – the percentage of the tuples in the database that have age between 20 and 29 and income between 20K and 29K and buying PC
– What is confidence? – the probability that if person is between 20 and 29 and income between 20K and 29K then it buys PC
• Clustering (getting data that are close together into the same cluster. – What does “close together” means?
Page 19
Distances between data
• Distance between data is a measure of dissimilarity between data.d(i,j)>=0; d(i,j) = d(j,i); d(i,j)<= d(i,k) + d(k,j)
• Euclidean distance: <x1,x2, … xk> and <y1,y2,…yk>
• Standardize variables by finding standard deviation and dividing each xi by standard deviation of X
• Covariance(X,Y)=1/k(Sum(xi-mean(x))(y(I)-mean(y))
• Boolean variables and their distances
Page 20
Data Mining Tasks
• Outlier analysis– Outlier: a data object that does not comply with the general behavior of
the data
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Page 21
Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of patterns,
not all of them are interesting.
– Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree
of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
Page 22
Are All the “Discovered” Patterns Interesting? - Example
1
tea
coffee
0 70 755
5 20 25
0 1
Conditional probability that if one buys coffee, one also buys teais 2/9Conditional probability that if one buys tea she also buys coffeeis 20/25=.8However, the probability that she buys coffee is .9So, is it significant inference that if customer buys tea she also buys coffee?Is buying tea and coffee independent activities?
Page 23
How to measure Interestingness
• RI = | X , Y| - |X||Y|/N
• Support and Confidence: |X Y|/N – support and |X Y|/|X| -
confidence of X->Y
• Chi^2: (|XY| - E(|XY|)) ^2 /E(|XY|);
• J(X->Y) = P(Y)(P(X|Y)*log (P(X|Y)/P(X)) + (1- P(X|Y))*log ((1- P(X|
Y)/(1-P(X))
• Sufficiency (X->Y) = P(X|Y)/P(X|!Y); Necessity (X->Y) = P(!X|Y)/P(!
X|!Y). Interestingness of Y->X is
NC++ = 1-N(X->Y)*P(Y), if N(…) is less than 1 or 0 otherwise
Page 24
Can We Find All and Only Interesting Patterns?
• Find all the interesting patterns: Completeness– Can a data mining system find all the interesting patterns?
– Association vs. classification vs. clustering
• Search for only interesting patterns: Optimization– Can a data mining system find only the interesting patterns?
– Approaches
• First general all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—mining query optimization
Page 25
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered but not if data
is “smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms.
Page 26
Example: Clusters
x xx x x xx x x x
x x xx x
xxx x
x x x x x
xx x x
x
x xx x x x x x x
x
x
x
Outliers
Page 27
Sampling
• Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
• Choose a representative subset of the data– Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods– Stratified sampling:
• Approximate the percentage of each class (or subpopulation of interest) in the overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).
Page 28
Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
Page 29
Sampling
Raw Data Cluster/Stratified Sample
Page 30
Discretization
• Three types of attributes:– Nominal — values from an unordered set– Ordinal — values from an ordered set– Continuous — real numbers
• Discretization: divide the range of a continuous attribute into intervals– Some classification algorithms only accept categorical
attributes.– Reduce data size by discretization– Prepare for further analysis
Page 31
Discretization
• Discretization – reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
Page 32
Discretization
Sort Attribute
Select cut Point
Evaluate Measure
Satisfied
Split/Merge
NO
Yes
Stop
NO
DONE
Page 33
Discretization
• Dynamic vs Static
• Local vs Global
• Top-Down vs Bottom-Up
• Direct vs Incremental
Page 34
Discretization – Quality Evaluation
• Total number of Intervals
• The Number of Inconsistencies
• Predictive Accuracy
• Complexity
Page 35
Discretization - Binning
• Equal width – all range is between min and
max values is split in equal width intervals
• Equal-frequency - Each bin contains
approximately the same number of data
Page 36
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
• The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
• Experiments show that it may reduce data size and improve classification accuracy
E S TS
EntS
EntS S S S( , )| |
| |( )
| |
| |( ) 1
12
2
Ent S E T S( ) ( , )
Page 37
Data Mining Primitives, Languages, and System Architectures
• Data mining primitives: What defines a data
mining task?
• A data mining query language
• Design graphical user interfaces based on a
data mining query language
• Architecture of data mining systems
Page 38
Why Data Mining Primitives and Languages?
• Data mining should be an interactive process – User directs what to be mined
• Users must be provided with a set of primitives to be used to communicate with the data mining system
• Incorporating these primitives in a data mining query language– More flexible user interaction – Foundation for design of graphical user interface– Standardization of data mining industry and practice
Page 39
What Defines a Data Mining Task ?
• Task-relevant data
• Type of knowledge to be mined
• Background knowledge
• Pattern interestingness measurements
• Visualization of discovered patterns
Page 40
Task-Relevant Data (Minable View)
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria
Page 41
Types of knowledge to be mined
• Characterization
• Discrimination
• Association
• Classification/prediction
• Clustering
• Outlier analysis
• Other data mining tasks
Page 42
A Data Mining Query Language (DMQL)
• Motivation
– A DMQL can provide the ability to support ad-hoc and interactive
data mining
– By providing a standardized language like SQL
• Hope to achieve a similar effect like that SQL has on relational
database
• Foundation for system development and evolution
• Facilitate information exchange, technology transfer,
commercialization and wide acceptance
• Design
– DMQL is designed with the primitives described earlier
Page 43
Syntax for DMQL
• Syntax for specification of
– task-relevant data
– the kind of knowledge to be mined
– concept hierarchy specification
– interestingness measure
– pattern presentation and visualization
• Putting it all together — a DMQL query
Page 44
Syntax for task-relevant data specification
• use database database_name, or use data
warehouse data_warehouse_name
• from relation(s)/cube(s) [where condition]
• in relevance to att_or_dim_list
• order by order_list
• group by grouping_list
• having condition
Page 45
Specification of task-relevant data
Page 46
Syntax for specifying the kind of knowledge to be mined
• CharacterizationMine_Knowledge_Specification ::=
mine characteristics [as pattern_name] analyze measure(s)
• DiscriminationMine_Knowledge_Specification ::=
mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where
contrast_condition_i} analyze measure(s)
• AssociationMine_Knowledge_Specification ::=
mine associations [as pattern_name]
Page 47
Syntax for specifying the kind of knowledge to be mined (cont.)
Classification
Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze
classifying_attribute_or_dimension Prediction
Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}}
Page 48
Syntax for concept hierarchy specification
• To specify what concept hierarchies to useuse hierarchy <hierarchy> for <attribute_or_dimension>
• We use different syntax to define different type of hierarchies– schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
– set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
Page 49
Syntax for concept hierarchy specification (Cont.)
– operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age)
– rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
Page 50
Syntax for interestingness measure specification
• Interestingness measures and thresholds can be
specified by the user with the statement:
with <interest_measure_name> threshold = threshold_value
• Example:with support threshold = 0.05
with confidence threshold = 0.7
Page 51
Syntax for pattern presentation and visualization specification
• We have syntax which allows users to specify the display of discovered patterns in one or more forms
display as <result_form>
• To facilitate interactive viewing at different concept level, the following syntax is defined:
Multilevel_Manipulation ::= roll up on attribute_or_dimension
| drill down on attribute_or_dimension
| add attribute_or_dimension
| drop attribute_or_dimension
Page 52
Putting it all together: the full specification of a DMQL query
use database AllElectronics_db use hierarchy location_hierarchy for B.addressmine characteristics as customerPurchasing analyze count% in relevance to C.age, I.type, I.place_made from customer C, item I, purchases P, items_sold S, works_at W,
branchwhere I.item_ID = S.item_ID and S.trans_ID = P.trans_ID
and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx'' and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = ``Canada" and I.price >= 100
with noise threshold = 0.05 display as table
Page 53
DMQL and SQL
• DMQL: Describe general characteristics of graduate students in the Big-University databaseuse Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpafrom studentwhere status in “graduate”
• Corresponding SQL statement:Select name, gender, major, birth_place, birth_date, residence,
phone#, gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }
Page 54
Decision Trees
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign
trainingset
Page 55
One Possibility
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
age<30
city=sf car=van
likely likelyunlikely unlikely
YY
Y
NN
N
Page 56
Another Possibility
sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no
car=taurus
city=sf age<45
likely likelyunlikely unlikely
YY
Y
NN
N
Page 57
Issues
• Decision tree cannot be “too deep”• would not have statistically significant
amounts of data for lower decisions
• Need to select tree that most reliably predicts outcomes
Page 58
Top-Down Induction of Decision Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
Outlook
Humidity Wind
sunny rainovercast
yes
no yes
high normal
no
strong weak
yes
PlayTennis = {yes, no}
Page 59
Entropy and Information Gain
• S contains si tuples of class Ci for i = {1, …, m}
• Information measures info required to classify any arbitrary tuple
• Entropy of attribute A with values {a1,a2,…,av}
• Information gained by branching on attribute A
s
slog
s
s),...,s,ssI(
im
i
im21 2
1
)s,...,s(Is
s...sE(A) mjj
v
j
mjj1
1
1
E(A))s,...,s,I(sGain(A) m 21
Page 60
Example: Analytical Characterization
• Task– Mine general characteristics describing graduate students
using analytical characterization
• Given– attributes name, gender, major, birth_place, birth_date,
phone#, and gpa– Gen(ai) = concept hierarchies on ai
– Ui = attribute analytical thresholds for ai
– Ti = attribute generalization thresholds for ai
– R = attribute relevance threshold
Page 61
Example: Analytical Characterization (cont’d)
• 1. Data collection– target class: graduate student– contrasting class: undergraduate student
• 2. Analytical generalization using Ui
– attribute removal• remove name and phone#
– attribute generalization• generalize major, birth_place, birth_date and gpa• accumulate counts
– candidate relation: gender, major, birth_country, age_range and gpa
Page 62
Example: Analytical characterization (3)
• 3. Relevance analysis– Calculate expected info required to classify an arbitrary tuple
– Calculate entropy of each attribute: e.g. major
99880250
130
250
130
250
120
250
120130120 2221 .loglog),I()s,I(s
For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183
For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892
For major=”Business”: S13=0 S23=42 I(s13,s23)=0
Number of grad students in “Science” Number of undergrad
students in “Science”
Page 63
Example: Analytical Characterization (4)
• Calculate expected info required to classify a given sample if S is partitioned according to the attribute
• Calculate information gain for each attribute
– Information gain for all attributes
78730250
42
250
82
250
126231322122111 .)s,s(I)s,s(I)s,s(IE(major)
2115021 .E(major))s,I(s)Gain(major
Gain(gender) = 0.0003
Gain(birth_country) = 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range) = 0.5971
Page 64
Example: Analytical characterization (5)
• 4. Initial working relation (W0) derivation
– R = 0.1
– remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country
– remove contrasting class candidate relation
• 5. Perform attribute-oriented induction on W0 using Ti
major age_range gpa count
Science 20-25 Very_good 16
Science 25-30 Excellent 47
Science 20-25 Excellent 21
Engineering 20-25 Excellent 18
Engineering 25-30 Excellent 18
Initial target class working relation W0: Graduate students
Page 65
What Is Association Mining?
• Association rule mining:– Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
• Applications:– Basket data analysis, cross-marketing, catalog design, loss-
leader analysis, clustering, classification, etc.
• Examples. – Rule form: “Body ead [support, confidence]”.– buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
Page 66
Association Rule Mining
tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9
transactio
n
id custo
mer
id products
bought
salesrecords:
• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9
market-basketdata
Page 67
Association Rule
• Rule: {p1, p3, p8}
• Support: number of baskets where these products appear
• High-support set: support threshold s
• Problem: find all high support sets
Page 68
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto accessories also
get automotive services done
• Applications– * Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)– Home Electronics * (What other products should the store
stocks up?)– Attached mailing in direct marketing– Detecting “ping-pong”ing of patients, faulty “collisions”
Page 69
Rule Measures: Support and Confidence
• Find all the rules X & Y Z with minimum confidence and support– support, s, probability that a transaction
contains {X Y Z}– confidence, c, conditional probability
that a transaction having {X Y} also contains Z
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Page 70
Mining Association Rules—An Example
For rule A C:support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
Page 71
Mining Frequent Itemsets: the Key Step
• Find the frequent itemsets: the sets of items that have
minimum support– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k (k-
itemset)
• Use the frequent itemsets to generate association
rules.
Page 72
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
Page 73
The Apriori Algorithm — Example
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
Page 74
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Page 75
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a transaction
Page 76
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
Page 77
Criticism to Support and Confidence
• Example 1: (Aggarwal & Yu, PODS98)– Among 5000 students
• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal
– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
Page 78
Criticism to Support and Confidence (Cont.)
• Example 2:– X and Y: positively correlated,– X and Z, negatively related– support and confidence of X=>Z dominates
• We need a measure of dependent or correlated events
• P(B|A)/P(B) is also called the lift of rule A => B
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(
)(, BPAP
BAPcorr BA
Page 79
Other Interestingness Measures: Interest
• Interest (correlation, lift)
– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A
and B positively correlated
)()(
)(
BPAP
BAP
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57
Page 80
• Classification: – predicts categorical class labels– classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in classifying new data
• Prediction: – models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications– credit approval– target marketing– medical diagnosis– treatment effectiveness analysis
Classification vs. Prediction
Page 81
Classification Process: Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Page 82
Classification Process: Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Page 83
Supervised vs. Unsupervised Learning
• Supervised learning (classification)– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
• Unsupervised learning (clustering)– The class labels of training data are unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Page 84
Training Dataset
age income student credit_rating<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair>40 medium no excellent
This follows an example from Quinlan’s ID3
Page 85
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Page 86
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized in
advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf– There are no samples left
Page 87
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary example
in S belongs to P or N is defined as
npn
npn
npp
npp
npI
22 loglog),(
Page 88
Information Gain in Decision Tree Induction
• Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify objects in all subtrees Si is
• The encoding information that would be gained by branching on A
1),()(
iii
ii npInp
npAE
)(),()( AEnpIAGain
Page 89
Attribute Selection by Information Gain Computation
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:Hence
Similarly
age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971
69.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIageE
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
)(),()( ageEnpIageGain
Page 90
Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as
• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
n
jp jTgini
1
21)(
)()()( 22
11 Tgini
NN
TginiNNTginisplit
Page 91
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules• One rule is created for each path from the root to a leaf• Each attribute-value pair along a path forms a conjunction• The leaf node holds the class prediction• Rules are easier for humans to understand• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Page 92
Avoid Overfitting in Classification
• The generated tree may overfit the training data – Too many branches, some may reflect anomalies due to noise
or outliers– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees• Use a set of data different from the training data to decide
which is the “best pruned tree”
Page 93
Approaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is minimized
Page 94
Scalable Decision Tree Induction Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)– builds an index for each attribute and only class list and the current
attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)– integrates tree splitting and tree pruning: stop growing the tree
earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)– separates the scalability aspects from the criteria that determine
the quality of the tree– builds an AVC-list (attribute, value, class label)
Page 95
Bayesian Theorem
• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
• MAP (maximum posteriori) hypothesis
• Practical difficulty: require initial knowledge of many probabilities, significant computational cost
)()()|()|(
DPhPhDPDhP
.)()|(maxarg)|(maxarg hPhDPHh
DhPHhMAP
h
Page 98
Bayesian classification
• The classification problem may be formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny,windy=true,…)
• Idea: assign to sample X the class label C such that P(C|X) is maximal
Page 99
Estimating a-posteriori probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Page 100
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
• If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases
Page 101
Play-tennis example: estimating P(xi|C)
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
outlook
P(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
P(p) = 9/14
P(n) = 5/14
Page 102
Play-tennis example: classifying X
• An unseen sample X = <rain, hot, high, false>
• P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
• Sample X is classified in class n (don’t play)
Page 103
Association-Based Classification
• Several methods for association-based classification – ARCS: Quantitative association mining and clustering of
association rules (Lent et al’97)• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98) • It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
– CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
• Mine Eps based on minimum support and growth rate
Page 104
What Is Prediction?
• Prediction is similar to classification– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Page 105
• Linear regression: Y = + X– Two parameters , and specify the line and are to be
estimated by using the data at hand.– using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.– Many nonlinear functions can be transformed into the above.
• Log-linear models:– The multi-way table of joint probabilities is approximated by a
product of lower-order tables.– Probability: p(a, b, c, d) = ab acad bcd
Regression Analysis and Log-Linear Models in Prediction
Page 107
General Applications of Clustering
• Pattern Recognition• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces– detect spatial clusters and explain them in spatial data mining
• Image Processing• Economic Science (especially market research)• WWW
– Document classification– Cluster Weblog data to discover groups of similar access patterns
Page 108
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• City-planning: Identifying groups of houses according to their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Page 109
What Is Good Clustering?
• A good clustering method will produce high quality
clusters with– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Page 110
Types of Data in Cluster Analysis
• Data matrix
• Dissimilarity matrix
npx...nfx...n1x
...............ipx...ifx...i1x
...............1px...1fx...11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Page 111
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric:
d(i, j)• There is a separate “quality” function that measures the
“goodness” of a cluster.• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications and data semantics.
• It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.
Page 112
Similarity and Dissimilarity Between Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp jxixjxixjxixjid
Page 113
Similarity and Dissimilarity Between Objects
• If q = 2, d is Euclidean distance:
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
Page 114
Binary Variables
• A contingency table for binary data
• Simple matching coefficient (invariant, if the binary
variable is symmetric):
• Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
dcbacb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
cbacb jid
),(
Object i
Object j
Page 115
Dissimilarity between Binary Variables
• Example
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
Page 116
Major Clustering Methods
• Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Page 117
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
Page 118
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in 4 steps:– Partition objects into k nonempty subsets– Compute seed points as the centroids of the clusters of the
current partition. The centroid is the center (mean point) of the cluster.
– Assign each object to the cluster with the nearest seed point. – Go back to Step 2, stop when no more new assignment.
Page 119
The K-Means Clustering Method
• Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Page 120
Comments on the K-Means Method
• Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.– Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic algorithms
• Weakness– Applicable only when mean is defined, then what about categorical
data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers– Not suitable to discover clusters with non-convex shapes
top related