chapter 20 data analysis and mining. 2 n decision support systems obtain high-level information out...

Chapter 20

Data Analysis and Mining

2

Data Analysis and Mining

Decision Support Systems Obtain high-level information out of detailed information stored

in (DB) transaction-processing systems for decision making

Data Analysis and OLAP (OnLine Analytical Processing)

• Develop tools/techniques for generating summarized data from very large DBs for data analysis

Data Warehousing

• Integrate data from multiple sources with a unified schema at a single site, where materialized views are created

Data Mining

• Adopt AI/statistical analysis techniques for knowledge discovery in very large DBs

3

Decision Support Systems Decision-support systems are used to make business

decisions, often based on data collected by on-line transaction- processing systems.

Examples of business decisions:

What items to stock?

What insurance premium to change?

To whom to send advertisements?

Examples of data used for making decisions

Retail sales transaction details

Customer profiles (income, age, gender, etc.)

4

Data Mining Data mining (DM) is the process of semi-automatically

analyzing large databases to find useful patterns

Prediction (a DM technique) based on past history Predict if a credit card applicant poses a good credit risk, using

some attributes (income, job type, age) & past payment history

Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms: Classification

• Given a new item whose class is unknown, predict to which class it belongs

Regression formulae

• Given a set of mappings for an unknown function, predict the function result for a new parameter value

5

Classification Rules Classification rules help assign new objects to classes

e.g., given a new automobile insurance applicant, should (s)he be classified as low risk, medium risk, or high risk?

Classification rules for above example could use a variety of data, such as educational level, salary, age, etc.

person P, P.degree = “masters” and P.income > 100,000 P.credit = excellent

person P, P.degree = “bachelors” and (P.income 75,000 and P.income 100,000) P.credit = good

Rules are not necessarily exact: there may be some misclassifications

Classification rules can be shown compactly as a decision tree

6

Decision Tree

7

Construction of Decision Trees Training set: a data sample in which the classification is

already known.

Greedy top-down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition (on samples) for the node

Leaf node:

• all (or most) of the items at the node belong to the same class, or

• all attributes have been considered, and no further partitioning is possible

Decision trees can also be represented as sets of IF-THEN rules

8

Construction of Decision Trees Example. A decision tree for the concept Play_Tennis

The instance (Outlook = Sunny, Humidity = High, Temperature = Hot, Wind = Strong)

is classified as a negative instance (i.e., predicting PlayTennis = no)

9

Decision-Tree Induction: Attribute Selection Information gain (Gain) measure: select the test attribute at

each node N in the decision tree T The attribute A with the highest Gain (or greatest entropy

reduction) is chosen as the test/partitioning attribute of N

A reflects the least randomness or impurity in the partition

Expected Information Needed (I), of the entire sample data set, to classify a given sample is given by

I(s1, s2, …, sm) = - pi log2 pi

where si (1 i m) is the number of data samples in a subset of S with s data samples, which yields class Ci,

m is the number of distinct classes: C1, C2, … Cm

pi is the probability that an arbitrary sample belongs to Ci, i.e., pi = si /s, and

log2 is used as information is encoded in bits

m

i =1

10

Decision-Tree Induction: Entropy Use entropy to determine expected info. of an attribute

Entropy (or expected information) E of an attribute A Let v be the number of distinct values {a1, a2, …, av} of A

Suppose A partitions S into v subsets, {S1, S2, …, Sv}, where Si (1 i v) is the number of distinct values ai of A in S

If A is the selected node label, i.e., the best attribute for splitting, then {S1, …, Sv} correspond to the labeled branches from

A

E(A) = ( I(s1j, s2j, …, smj) )

where (s1j +s2j +…+ smj) / s is the weight of sj (having aj of A) in S,

sij (1 i m) is the number of data samples in Ci in Sj,

The smaller the E(A) value, the greater the purity of the subset partition, i.e., {S1, S2, …, Sv}

v

j =1

s1j + s2j + … + smj

s

11

Decision-Tree Induction: Information Gain Given the entropy of A:

E(A) = ( I(s1j, s2j, …, smj) )

I(s1j, s2j, …, smj) is defined as

I(s1j, s2j, …, smj) = - pij log2 pij

where pij = sij / |Sj|, the probability that a sample in Sj is in Ci

The information gain of attribute A, i.e., the information gained by using A (as a node label), is defined as

Gain(A) = I(s1, s2, …, sm) – E(A)

Gain(A) is the expected reduction in entropy using values of A The attribute with the highest information gain is chosen for the

given set of samples S (in a recursive manner)

v

j =1

s1j + s2j

+ … + smj

s

m

i =1

12

Decision-Tree Induction Example. Given the following training (sample) data, use the

Decision Tree to predict customers who buy computers

Two classes of samples {yes, no}, i.e., S1 = ‘yes’, S2 = ‘no’, & m = 2

S = 14;

I(s1, s2) = I(9, 5) = - 9/14 log2 9/14 - 5/14 log2 5/14 = 0.94

which is the expected info. needed of all

Next, compute the entropy of each attribute, i.e., age, income, etc.

Consider Age with 3 distinct classes, i.e., “<= 30”, “31..40”, “> 40”

|S1| = 9 ; |S2| = 5,m

i =1since I(s1, s2, …, sm) = - pi log2 pi,

13

Example: Consider Age

Since E(A) = ( I(s1j, s2j, …, smj)), and

I(s1j, s2j, …, smj) = - pi log2 pi,

• Age = “ 30”: s11 = 2, s21 = 3, I(s11, s21) = -2/5 log2 2/5 - 3/5 log2 3/5 = 0.971

• Age = “31 .. 40”: s12 = 4, s22 = 0, I(s12, s22) = -4/4 log2 4/4 - 0/4 log2 0/4 = 0

• Age = “> 40”: s13 = 3, s23 = 2, I(s13, s23) = -3/5 log2 3/5 - 2/5 log2 2/5 = 0.971

• E(Age) = 5/14 I(s11, s21) + 4/14 I(s12, s22) + 5/14 I(s13, s23) = 0.694

• Gain(Age) = I(s1, s2) – E(Age) = 0.94 – 0.694 = 0.246

v

j =1

s1j + s2j + … + smj

sm

i =1

14

Example: Consider Credit_Rating

Two classes of samples {yes, no}: S1 (yes), S2 = no, I(s1, s2) = 0.94

Since E(A) = ( I(s1j, s2j, …, smj)), and

I(s1j, s2j, …, smj) = - pi log2 pi,

• Credit_Rating = “Fair”: s11 = 6 and s21 = 2,

I(s11, s21) = - 6/8 log2 6/8 - 2/8 log2 2/8 = 0.81

• Credit_Rating = “Excellent”: s12 = 3 and s22 = 3

I(s12, s22) = - 3/6 log2 3/6 - 3/6 log2 3/6 = 1

• E(Credit_Rating) = 8/14 I(s11, s21) + 6/14 I(s12, s22) = 0.89

• Gain(Credit_Rating) = I(s1, s2) – E(Credit_Rating) = 0.94 – 0.89 = 0.05

2

j =1

s1j + s2j + … + smj

s m

i =1

15

Example: Consider Income


Since E(A) = ( I(s1j, s2j, …, smj) = - pi log2 pi)

• Income = “High”: s11 = 2 and s21 = 2,

I(s11, s21) = - 2/4 log2 2/4 - 2/4 log2 2/4 = 1

• Income = “Low”: s12 = 3 and s22 = 1,

I(s12, s22) = - 3/4 log2 3/4 - 1/4 log2 1/4 = 0.81

• Income = “Medium”: s12 = 4 and s22 = 2

I(s12, s22) = - 4/6 log2 4/6 - 2/6 log2 2/6 = 0.92

• E(Income) = 4/14 1 + 4/14 0.81 + 6/14 0.92 = 0.91

• Gain(Income) = I(s1, s2) – E(Income) = 0.94 – 0.91 = 0.03

3

j =1

s1j + s2j + … + smj

s

m

i =1

16

Example: Consider Student


Since E(A) = ( I(s1j, s2j, …, smj) = - pi log2 pi)

• Student = “Yes”: s11 = 6 and s21 = 1,

I(s11, s21) = - 6/7 log2 6/7 - 1/7 log2 1/7 = 0.59

• Student = “No”: s12 = 3 and s22 = 4,

I(s12, s22) = - 3/7 log2 3/7 - 4/7 log2 4/7 = 0.98

• E(Student) = 7/14 0.59 + 7/14 0.98 = 0.78

• Gain(Student) = I(s1, s2) – E(Student) = 0.94 – 0.78 = 0.15

2

j =1

s1j + s2j + … + smj

s

m

i =1

17

Decision-Tree Induction: ExampleSince Gain(Income) = 0.03, Gain(Student) = 0.15, Gain(Rating) = 0.05, and Gain(Age) = 0.246,

• A node is created and labeled with Age

• Branches are grown for each for the attribute’s values

A leaf node (in the ‘Yes’ Class)

Age is the chosen attribute (to split)

18

Decision-Tree Induction: Example Consider the leftmost branch.

Since Gain(Student) = 0.971, Gain(Income) = 0.571, and Gain(Rating) = 0.02,

• A node is created and labeled with Student• Branches are grown for each for the attribute’s values

Income Student Rating ClassHigh No Fair NoHigh No Excellent NoMedium No Fair NoLow Yes Fair YesMedium Yes Excellent Yes

Income Rating ClassHigh Fair NoHigh Excellent NoMedium Fair No

Income Rating Class Low Fair YesMedium Excellent Yes

No YesStudent

A leaf node (in the ‘No’ Class) A leaf node (in the ‘Yes’ Class)

Student is the chosen attribute (to split)

19

Decision-Tree Induction: Example Example. Given the following training (sample) data,

the constructed final Decision Tree to predict customers who buy computers is

20

Association Rules Retail shops are often interested in associations between

different items that people buy. Someone who buys bread is quite likely also to buy milk A person who bought the book Database System Concepts is

quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways e.g., when a customer buys a particular book, an online shop may

suggest associated books

Association rules: bread milk. DB-Concepts, OS-Concepts Networks

Left hand side: antecedent; right hand side: consequent An association rule must have an associated population; the

population consists of a set of instances• e.g., each transaction (sale) at a shop is an instance, and the set

of all transactions is the population

21

Association Rules (Cont.) Rules have an associated support, as well as an associated

confidence

Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule

e.g., suppose only 0.001% of all purchases include milk and screw drivers. The support for the rule is

milk screw drivers is low

Confidence is a measure of how often the consequent is true when the antecedent is true.

e.g., the rule bread milk has a confidence of 80%, if 80% of the purchases that include bread also include

milk

22

Finding Association Rules We are generally only interested in association rules with

reasonably high support (e.g., support 2%)

Naïve algorithm

1. Consider all possible sets of relevant items

2. For each set find its support (i.e., count how many transactions purchase all items in the set)

Large itemsets: sets with sufficiently high support

3. Use large itemsets to generate association rules

From itemset S generate the rule (S - s) s for each s S

• Support of rule = support(S)

• Confidence of rule = support(s) support(S)

23

Finding Support Determine support of itemsets via a single pass on set of transactions

Large itemsets: sets with a high count at the end of the pass

If memory not enough to hold all counts for all itemsets, use multiple passes considering only some itemsets in each pass

Optimization: Once an itemset is eliminated because its count (support) is too small, none of its supersets needs to be considered.

The a priori technique to find large itemsets: Pass 1: count support of all sets with just 1 item. Eliminate those items

with low support

Pass i: candidates include every set of i items such that all its i - 1 item subsets are large

• Count support of all candidates; stop, if there are no candidates

chapter 20 data analysis and mining. 2 n decision support systems obtain high-level information out...

Documents