data mining (overview)

Data Mining(overview)

Presentation overview• Introduction• Association Rules• Classification• Clustering• Similar Time Sequences• Similar Images• Outliers• WWW

• Summary

Background

• Corporations have huge databases containing a wealth of information

• Business databases potentially constitute a goldmine of valuable business information

• Very little functionality in database systems to support data mining applications

• Data mining: The efficient discovery of previously unknown patterns in large databases

Applications

• Fraud Detection• Loan and Credit Approval• Market Basket Analysis• Customer Segmentation• Financial Applications• E-Commerce• Decision Support• Web Search

Data Mining Techniques

• Association Rules• Sequential Patterns• Classification• Clustering• Similar Time Sequences• Similar Images• Outlier Discovery• Text/Web Mining

Examples of Discovered Patterns

• Association rules–98% of people who purchase diapers also buy beer

• Classification–People with age less than 25 and salary > 40k drive

sports cars• Similar time sequences

–Stocks of companies A and B perform similarly• Outlier Detection

–Residential customers for telecom company with businesses at home

Association Rules

• Given:• A database of customer transactions• Each transaction is a set of items

• Find all rules X => Y that correlate the presence of one set of items X with another set of items Y

• Example: 98% of people who purchase diapers and baby food also buy beer.

• Any number of items in the consequent/antecedent of a rule• Possible to specify constraints on rules (e.g., find only rules

involving expensive imported products)

Association Rules

• Sample Applications• Market basket analysis

• Attached mailing in direct marketing

• Fraud detection for medical insurance

• Department store floor/shelf planning

Confidence and Support• A rule must have some minimum user-specified

confidence

1 & 2 >= 3 has 90% confidence if when a customer bought 1 and 2, in 90% of cases, the customer also bought 3.• A rule must have some minimum user-specified

support (how frequently the rule occurs)

1 & 2 >= 3 should hold in some minimum percentage of transactions to have business value

Example

• For minimum support = 50%, minimum confidence = 50%, we have the following rules

1 >= 3 with 50% support and 66% confidence )1&3 happened in 50% of cases, but whenever 1 happened only in 2/3 of cases 3

happened too(

3 >= 1 with 50% support and 100% confidence(3&1 happened in 50% of cases, but whenever 3 happened 1 happened too)

Transaction Id Purchased Items 1 {1, 2, 3}2 {1, 4}3 {1, 3}4 {2, 5, 6}

Quantitative Association Rules

• Quantitative attributes (e.g. age, income)• Categorical attributes (e.g. make of car)

[Age: 30..39] and [Married: Yes] => [NumCars:2]

CID Age Married NumCars1 23 No 12 25 Yes 13 29 No 04 34 Yes 25 38 Yes 2

min support = 40% min confidence = 50%

Definition?

Temporal Association Rules

• Can describe the rich temporal character in data

• Example:

–{diaper{ >- }beer} ) support = 5%, confidence = 87%(

–Support of this rule may jump to 25% between 6 to 9 PM weekdays

• Problem: How to find rules that follow interesting user-defined temporal patterns

• Challenge is to design efficient algorithms that do much better than finding every rule in every time unit

Correlation Rules• Association rules do not capture correlations• Example:

• Suppose 90% customers buy coffee, 25% buy tea and 20% buy both tea and coffee

• }tea{ => }coffee{ has high support 0.2 and confidence 0.8

• }tea, coffee{ are not correlated

• expected support of customers buying both is 0.9 * 0.25 = 0.225

Sequential Patterns• Given:

• A sequence of customer transactions• Each transaction is a set of items

• Find all maximal sequential patterns supported by more than a user-specified percentage of customers

• Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction

• 10% is the support of the pattern

Classification

• Given:• Database of tuples, each assigned a class label

• Develop a model/profile for each class• Example profile (good credit): (25 >= age >= 40 and

income > 40k) or (married = YES)

–Sample applications:• Credit card approval (good, bad)• Bank locations (good, fair, poor)• Treatment effectiveness (good, fair, poor)

Decision Trees50 Churners50 Non-Churners

30 Churners50 Non-Churners






New technology phone Old technology phone

Customer <= 2.3 years Customer > 2.3 years

Age <= 55 Age > 55

A decision tree is a predictive model that makes a prediction on the basis of a series of decisions

Decision Trees

DT are creating a segmentation of the original data set. This segmentation is done for the prediction of some information. The records fall in each segment have similarity with respect to the information being predicted. The DT and the algorithms may be complex, but the results are presented in an easy-to-understand way, quite useful to the business user.

Decision Trees

DT in business :

•Automation – Very favorable technique for automating the data mining and predictive modeling. They embed automated solutions to things that other techniques leave as a burden to the user (4/4)•Clarity – The models are viewed as a tree of simple decisions based on familiar predictors or as a set of rules. The user can confirm the DT or modify by hand on the basis of his own expertise (4/4)•ROI – Because DT work well with relational databases, they provide well-integrated solutions with highly accurate models (3/4)

Decision Trees• Pros

• Fast execution time

• Generated rules are easy to interpret by humans

• Scale well for large data sets

• Can handle high dimensional data

• Cons• Cannot capture correlations among attributes

• Consider only axis-parallel cuts

Clustering• Given:

• Data points and number of desired clusters K

• Group the data points into K clusters• Data points within clusters are more similar than across

clusters

• Sample applications:• Customer segmentation• Market basket customer analysis• Attached mailing in direct marketing• Clustering companies with similar growth

Where to use clustering and nearest-neighbor prediction

• Clustering for clarity– A high-level view– Segmentation

• Clustering for outlier analysis– To see records that stick out of the rest– e.g. Wine distributors produce a certain level of profit. One store produces

significantly lower profit. Turns out that the distributor was delivering to but not collecting payment from one of its customers.

• Nearest neighbor for prediction– Objects “near” to each other have similar prediction values.– Examples: “to find more documents as this one” among journal

articles, the value to be predicted in the next value of stock price based on time series.

Outlier Discovery

• Sometimes clustering is performed to see when one record sticks out of the rest

• E.g. One store stands out as producing significantly lower profit. Closer examination shows that the distributor was not collecting payment from one of his customers

• E.g. A sale of man’s suits is being held in all branches of a department store. All stores, but one, have seen at least 100% jump in revenue. It turns out that store had advertised via radio rather than TV as other stores

• Sample applications:• Credit card fraud detection• Telecom fraud detection• Customer segmentation• Medical analysis

Outlier Discovery

• Given:• Data points and number of outliers (= n) to find

• Find top n outlier points • outliers are considerably dissimilar from the

remainder of the data

Statistical Approaches• Model underlying distribution that generates

dataset (e.g. normal distribution)• Use discordancy tests depending on

• data distribution

• distribution parameter (e.g. mean, variance)

• number of expected outliers

• Drawbacks• most tests are for single attribute

• In many cases, data distribution may not be known

Differences between the nearest-neighbor technique and clustering

• Used for prediction and consolidation

• Space is defined by the problem to be solved

• Generally only uses distance metrics to determine nearness

• Used for consolidating data into a high level view and general grouping of records into like behaviors

• Space is defined as default n-dimensional space, or by the user, or predefined space driven by past experience

• Can use other metrics beside distance to determine nearness of two records- e.g.linking points together

Nearest neighbors Clustering

How clustering and nearest-neighbor work

• Looking at n-dimensional space– The distance between the cluster and a given data point is often measured from the center of mass of

the cluster– The center can be calculated

• By simply average income and age of each record• By square error criterion• Other

– Many clustering problems have hundreds of dimensions. Our intuition works only in 2 or 3-dimensional space

Income $120,000

Age

1

00 Y

rs.

Cluster 1

Cluster 2

Cluster 3Customers of a golf equipment business

Cluster 1 – retirees with modest incomeCluster 2 – middle-aged weekend golfersCluster 3 – wealthy youth with exclusive club membership

Outliers

Traditional AlgorithmsPartitional algorithms

• Enumerate K partitions optimizing some criterion

• Example: square-error criterion

• mi is the mean of cluster Ci

k

i p Ci

i

mp1

2

How is “nearness” defined

• The “trivial” case

ID Name Prediction Age Balance($) Income Eyes Gender

5 Carla Yes 21 2300 High Blue FSue ?? 21 2300 High Blue F

Exactly the same as the record to be predicted is considered “close.”However, it is unlikely to find exact matches

• The Manhattan Distance metric adds up the differences between each predictor between the historical record and the record to be predicted

• The Euclidean Distance metrics calculates distance the Pythagorean way (the square of the hypotenuse is equal to the sum of squares of the other two sides)

• Others…

• The Manhattan Distance metric (an example)

ID Name Prediction Age Balance($) Income Eyes Gender

5 Carla Yes 21 2300 High Blue FCarl No 27 5400 High Brown M

Calculating the difference between ages (6 years) and balances ($3100) is simple. Eyes color predictor? e.g. match=0, mismatch=1Income – assign numbers: high=3, medium=2, low=1

3108 = 6 + 3100 + 0 + 1 + 1

The result must be normalized (e.g. 0-100)

225 = 6 + 19 + 0 + 100 + 100

• Calculating dimension weights

– Different dimensions may have different weights• e.g. In text classification not all words (dimensions) are created equal:

“entrepreneur” is significant, “the” is not.

• Two methods– The inverse frequency of the word is used, “the” 1/10,000, “entrepreneur” 1/10– The importance of the word to the topic to be predicted. “entrepreneur” and

“venture capital” will be given higher weight then “tornado”, the topic is to start a small business

– Dimension weights have also been calculated via adaptive algorithms where random weights are tried initially and then slowly modified to improve the accuracy of the system (neural networks, genetic algorithms)

Hierarchy of Clusters

• The hierarchy of clusters is viewed as a tree in which the smallest clusters merge to create the next highest level of clusters.

– Agglomerative technique starts with as many clusters as there are records. The clusters that are nearest each other are merged to form the next largest cluster. This merging is continued until a hierarchy of clusters is built.

– Divisive technique takes the opposite approach. It starts with all the records in one cluster, then try to split that cluster into smaller pieces, etc.

• The hierarchy allows the end user to chose the level to work with

Large single cluster

Smallest clusters

Similar Time Sequences• Given:

• A set of time-series sequences

• Find• All sequences similar to the query sequence• All pairs of similar sequences

whole matching vs. subsequence matching

• Sample Applications• Financial market• Market basket data analysis• Scientific databases• Medical Diagnosis

Whole Sequence Matching

Basic Idea• Extract k features from every sequence• Every sequence is then represented as a point in k-

dimensional space• Use a multi-dimensional index to store and search these

points–Spatial indices do not work well for high dimensional

data

Similar Time Sequences

• Sequences are normalized with amplitude scaling and offset translation

• Two subsequences are considered similar if one lies within an envelope of width around the other, ignoring outliers

• Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences

Similar Sequences FoundVanEck International Fund Fidelity Selective Precious Metal and Mineral Fund

Two similar mutual funds in the different fund group

Similar Images

• Given:• A set of images

• Find:• All images similar to a given image• All pairs of similar images

• Sample applications:• Medical diagnosis• Weather predication• Web search engine for images• E-commerce

Similar Images

• QBIC[Nib93, FSN95], [JFS95], WBIIS[WWWFS98]• Generates a single signature per image

• Fails when the images contain similar objects, but at different locations or varying sizes

• [Smi97]• Divide an image into individual objects

• Manual extraction can be very tedious and time consuming

• Inaccurate in identifying objects and not robust

WALRUS

• Automatically extract regions from an image based on the complexity of images

• A single signature is used per each region• Two images are considered to be similar if they

have enough similar region pairs

WALRUS

Similarity Model

WALRUS (Overview)

Compute wavelet signaturesfor sliding windows

Cluster windows togenerate regions

Insert regions into spatial index (R* tree)

Compute wavelet signaturesfor sliding windows

Cluster windows togenerate regions

Find matching regions using spatial index

Compute similarity betweenquery image and target images

Image Querying PhaseImage Indexing Phase

WALRUS

Query image

04/08/23 42

Web Mining: Challenges

• Today’s search engines are plagued by problems:– the abundance problem (99% of info of no

interest to 99% of people)– limited coverage of the Web (internet sources

hidden behind search interfaces)– limited query interface based on keyword-

oriented search– limited customization to individual users

Web is …..

• The web is a huge collection of documents• Semistructured (HTML, XML)

• Hyper-link information

• Access and usage information

• Dynamic

(i.e. New pages are constantly being generated)

Web Mining

• Web Content Mining• Extract concept hierarchies/relations from the web

• Automatic categorization

• Web Log Mining• Trend analysis (i.e web dynamics info)

• Web access association/sequential pattern analysis

• Web Structure Mining• Google: A page is important if important pages point to it

04/08/23 45

Improving Search/Customization

• Learn about user’s interests based on access patterns

• Provide users with pages, sites and advertisements of interest

Summary

• Data mining: • Good science - leading position in research community

• Recent progress for large databases: association rules, classification, clustering, similar time sequences, similar image retrieval, outlier discovery, etc.

• Many papers were published in major conferences

• Still promising and rich field with many challenging research issues

• Maturing in industry

data mining (overview)

Documents

market basket

similar time

similar images

sequential

data mining

association

minimum user

purchase diapers