data mining: concepts and techniques - knowledge and...

28
1 January 20, 2006 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Slides for Textbook — — potpourri — ©Jiawei Han and Micheline Kamber http://www.cs.sfu.ca Potpourri composed by Yannis Theodoridis (May 2001) January 20, 2006 Data Mining: Concepts and Techniques 2 Contents Introduction Data Warehouses Data Preprocessing Data Mining Functionality Association Rules Classification Clustering Trend Analysis Social Impact A prototype system: DBMiner

Upload: vuonglien

Post on 09-Apr-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

1

January 20, 2006 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques

— Slides for Textbook —— potpourri —

©Jiawei Han and Micheline Kamber

http://www.cs.sfu.ca

Potpourri composed by

Yannis Theodoridis (May 2001)

January 20, 2006 Data Mining: Concepts and Techniques 2

Contents

IntroductionData Warehouses

Data PreprocessingData Mining Functionality

Association RulesClassificationClusteringTrend Analysis

Social ImpactA prototype system: DBMiner

Page 2: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

2

January 20, 2006 Data Mining: Concepts and Techniques 3

What Is Data Mining?

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) information or patterns from data in large databases

Alternative names and their “inside stories”: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What is not data mining?Expert systems or small ML/statistical programs

January 20, 2006 Data Mining: Concepts and Techniques 4

Data Mining ApplicationsData Mining Applications

Data mining is a young discipline with wide and diverse applications

a nontrivial gap exists between general principles of data mining and domain-specific, effective data mining tools for particular applications

Some application domains (covered in this chapter)

Biomedical and DNA data analysisFinancial data analysisRetail industryTelecommunication industry

Page 3: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

3

January 20, 2006 Data Mining: Concepts and Techniques 5

Commercial Data Mining tools Commercial Data Mining tools

Commercial data mining systems have little in common

Different data mining functionality or methodology May even work with completely different kinds of data sets

Need multiple dimensional view in selectionData types: relational, transactional, text, time sequence, spatial?System issues

running on only one or on several operating systems?a client/server architecture?Provide Web-based interfaces and allow XML data as input and/or output?

January 20, 2006 Data Mining: Concepts and Techniques 6

Commercial Data Mining toolsCommercial Data Mining tools

Data sourcesASCII text files, multiple relational data sourcessupport ODBC connections (OLE DB, JDBC)?

Data mining functions and methodologiesOne vs. multiple data mining functionsOne vs. variety of methods per function

More data mining functions and methods per function provide the user with greater flexibility and analysis power

Coupling with DB and/or data warehouse systemsFour forms of coupling: no coupling, loose coupling, semitightcoupling, and tight coupling

Ideally, a data mining system should be tightly coupled with a database system

Page 4: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

4

January 20, 2006 Data Mining: Concepts and Techniques 7

Commercial Data Mining toolsCommercial Data Mining tools

ScalabilityRow (or database size) scalabilityColumn (or dimension) scalabilityCurse of dimensionality: it is much more challenging to make a system column scalable that row scalable

Visualization tools“A picture is worth a thousand words”Visualization categories: data visualization, mining result visualization, mining process visualization, and visual data mining

Data mining query language and graphical user interfaceEasy-to-use and high-quality graphical user interface Essential for user-guided, highly interactive data mining

January 20, 2006 Data Mining: Concepts and Techniques 8

Examples of Data Mining Systems (1)Examples of Data Mining Systems (1)

IBM Intelligent MinerA wide range of data mining algorithmsScalable mining algorithmsToolkits: neural network algorithms, statistical methods, data preparation, and data visualization toolsTight integration with IBM's DB2 relational database system

Mirosoft SQLServer 2000Integrate DB and OLAP with miningSupport OLEDB for DM standard

Page 5: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

5

January 20, 2006 Data Mining: Concepts and Techniques 9

Examples of Data Mining Systems (2)Examples of Data Mining Systems (2)

SGI MineSetMultiple data mining algorithms and advanced statisticsAdvanced visualization tools

SAS Enterprise MinerA variety of statistical analysis toolsData warehouse tools and multiple data mining algorithms

Clementine (SPSS)An integrated data mining development environment for end-users and developersMultiple data mining algorithms and visualization tools

January 20, 2006 Data Mining: Concepts and Techniques 10

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 6: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

6

January 20, 2006 Data Mining: Concepts and Techniques 11

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

BusinessAnalyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

January 20, 2006 Data Mining: Concepts and Techniques 12

Architecture of a Typical DM System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Page 7: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

7

January 20, 2006 Data Mining: Concepts and Techniques 13

Data Mining Functionalities (1)

Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

Association (correlation and causality)

Multi-dimensional vs. single-dimensional association

age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

contains(T, “computer”) contains(x, “software”) [1%, 75%]

January 20, 2006 Data Mining: Concepts and Techniques 14

Data Mining Functionalities (2)

Classification and PredictionFinding models (functions) that describe and distinguish classes or concepts for future predictionE.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values

Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Page 8: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

8

January 20, 2006 Data Mining: Concepts and Techniques 15

Data Mining Functionalities (3)

Outlier analysisOutlier: a data object that does not comply with the general behavior of the

data

It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

Trend and evolution analysisTrend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Similarity-based analysis

Other pattern-directed or statistical analyses

January 20, 2006 Data Mining: Concepts and Techniques 16

Why Data Preprocessing?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datanoisy: containing errors or outliersinconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality data

Page 9: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

9

January 20, 2006 Data Mining: Concepts and Techniques 17

Association Rule: Basic Concepts

Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all rules that correlate the presence of one set of items with that of another set of items

E.g., 98% of people who purchase tires and auto accessories also get automotive services done

Applications* ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)Home Electronics ⇒ * (What other products should the store stocks up?)Attached mailing in direct marketingDetecting “ping-pong”ing of patients, faulty “collisions”

January 20, 2006 Data Mining: Concepts and Techniques 18

Rule Measures: Support & Confidence

Find all the rules X & Y ⇒ Z with minimum confidence and support

support, s, probability that a transaction contains {X Y Z}confidence, c, conditional probabilitythat a transaction having {X Y} also contains Z

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have

A ⇒ C (50%, 66.6%)C ⇒ A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Page 10: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

10

January 20, 2006 Data Mining: Concepts and Techniques 19

Visualization of Association Rule Using Plane Graph

January 20, 2006 Data Mining: Concepts and Techniques 20

Visualization of Association Rule Using Rule Graph

Page 11: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

11

January 20, 2006 Data Mining: Concepts and Techniques 21

Rule Mining: A Road MapRule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled)buys(x, “SQLServer”) ^ buys(x, “DMBook”) ⇒ buys(x, “UMiner”) [0.2%, 60%]age(x, “30..39”) ^ income(x, “42..48K”) ⇒ buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiple-level analysis

What brands of beers are associated with what brands of diapers?Various extensions

Correlation, causality analysisAssociation does not necessarily imply correlation or causality

Maxpatterns and closed itemsetsConstraints enforced

E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

January 20, 2006 Data Mining: Concepts and Techniques 22

Mining Association RulesMining Association Rules——An An ExampleExample

For rule A ⇒ C:

support = support({A &C}) = 50%confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle:

Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 12: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

12

January 20, 2006 Data Mining: Concepts and Techniques 23

Mining FrequentMining Frequent ItemsetsItemsets: the Key Step: the Key Step

Find the frequent itemsets: the sets of items that have minimum support

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

Use the frequent itemsets to generate association rules.

January 20, 2006 Data Mining: Concepts and Techniques 24

The The Apriori Apriori AlgorithmAlgorithm

Join Step: Ck is generated by joining Lk-1with itselfPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemsetPseudo-code:

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;for each transaction t in database do

increment the count of all candidates in Ck+1that are contained in t

Lk+1 = candidates in Ck+1 with min_supportend

return ∪k Lk;

Page 13: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

13

January 20, 2006 Data Mining: Concepts and Techniques 25

TheThe AprioriApriori Algorithm Algorithm —— ExampleExample

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

January 20, 2006 Data Mining: Concepts and Techniques 26

Candidates GenerationCandidates Generation

Suppose the items in Lk-1 are listed in an orderStep 1Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2Step 2: pruning

forall itemsets c in Ck doforall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 14: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

14

January 20, 2006 Data Mining: Concepts and Techniques 27

How to Count Supports of Candidates?How to Count Supports of Candidates?

Why counting supports of candidates a problem?

The total number of candidates can be very hugeOne transaction may contain many candidates

Method:

Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction

January 20, 2006 Data Mining: Concepts and Techniques 28

Example of Generating CandidatesExample of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

Page 15: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

15

January 20, 2006 Data Mining: Concepts and Techniques 29

Mining DistanceMining Distance--based Association Rulesbased Association Rules

Binning methods do not capture the semantics of interval data

Distance-based partitioning, more meaningful discretization considering:

density/number of points in an interval“closeness” of points in an interval

Price($)Equi-width(width $10)

Equi-depth(depth 2)

Distance-based

7 [0,10] [7,20] [7,7]20 [11,20] [22,50] [20,22]22 [21,30] [51,53] [50,53]50 [31,40]51 [41,50]53 [51,60]

January 20, 2006 Data Mining: Concepts and Techniques 30

S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set XThe diameter of S[X]:

distx:distance metric, e.g. Euclidean distance or Manhattan

)1(

])[],[(])[( 1 1

−=∑ ∑= =

NN

XtXtdistXSd

jiN

i

N

jX

Clusters and Distance MeasurementsClusters and Distance Measurements

Page 16: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

16

January 20, 2006 Data Mining: Concepts and Techniques 31

The diameter, d, assesses the density of a cluster CX , where

Finding clusters and distance-based rules

the density threshold, d0 , replaces the notion of support

modified version of the BIRCH clustering algorithm

XdCd X 0)( ≤

0sCX ≥

Clusters and Distance Measurements(Cont.)Clusters and Distance Measurements(Cont.)

January 20, 2006 Data Mining: Concepts and Techniques 32

Interestingness MeasurementsInterestingness Measurements

Objective measures

Two popular measurements: support; and confidence

Subjective measures (Silberschatz & Tuzhilin, KDD95)

A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)

Page 17: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

17

January 20, 2006 Data Mining: Concepts and Techniques 33

Criticism to Support and ConfidenceCriticism to Support and Confidence

Example 1: (Aggarwal & Yu, PODS98)

Among 5000 students3000 play basketball3750 eat cereal2000 both play basket ball and eat cereal

play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which ishigher than 66.7%.play basketball ⇒ not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

January 20, 2006 Data Mining: Concepts and Techniques 34

Criticism to Support and Confidence Criticism to Support and Confidence

Example 2:X and Y: positively correlated,X and Z, negatively relatedsupport and confidence of X=>Z dominates

We need a measure of dependent or correlated events

P(B|A)/P(B) is also called the lift of rule A => B

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support ConfidenceX=>Y 25% 50%X=>Z 37,50% 75%)()(

)(, BPAP

BAPcorr BA∪

=

Page 18: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

18

January 20, 2006 Data Mining: Concepts and Techniques 35

Other Interestingness Measures: InterestOther Interestingness Measures: Interest

Interest (correlation, lift)

taking both P(A) and P(B) in consideration

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1; otherwise A and B

positively correlated

)()()(

BPAPBAP ∧

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57

January 20, 2006 Data Mining: Concepts and Techniques 36

Association Rules:SummaryAssociation Rules:Summary

Association rule mining

probably the most significant contribution from the database community in KDD

A large number of papers have been publishedMany interesting issues have been explored

An interesting research direction

Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

Page 19: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

19

January 20, 2006 Data Mining: Concepts and Techniques 37

Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

January 20, 2006 Data Mining: Concepts and Techniques 38

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Page 20: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

20

January 20, 2006 Data Mining: Concepts and Techniques 39

Presentation of Classification Results

January 20, 2006 Data Mining: Concepts and Techniques 40

AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 21: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

21

January 20, 2006 Data Mining: Concepts and Techniques 41

DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected pointsDiscovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

January 20, 2006 Data Mining: Concepts and Techniques 42

ε

ε

Reachability-distance

Cluster-orderof the objects

undefined

ε‘

Page 22: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

22

January 20, 2006 Data Mining: Concepts and Techniques 43

Constraint-Based Clustering Analysis

Clustering analysis: less parameters but more user-desired constraints, e.g., an

ATM allocation problem

January 20, 2006 Data Mining: Concepts and Techniques 44

Mining Time-Series and Sequence Data

Time-series plot

Page 23: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

23

January 20, 2006 Data Mining: Concepts and Techniques 45

Mining Time-Series and Sequence Data: Trend analysis

A time series can be illustrated as a time-series graph which describes a point moving with the passage of time

Categories of Time-Series Movements

Long-term or trend movements (trend curve)

Cyclic movements or cycle variations, e.g., business cycles

Seasonal movements or seasonal variations

i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

Irregular or random movements

January 20, 2006 Data Mining: Concepts and Techniques 46

Social Impacts: Threat to Privacy and Data Security?

Is data mining a threat to privacy and data security?“Big Brother”, “Big Banker”, and “Big Business” are carefully watching youProfiling information is collected every time

You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the aboveYou surf the Web, reply to an Internet newsgroup, subscribe to amagazine, rent a video, join a club, fill out a contest entry form,You pay for prescription drugs, or present you medical care number when visiting the doctor

Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse

Page 24: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

24

January 20, 2006 Data Mining: Concepts and Techniques 47

Protect Privacy and Data Security

Fair information practicesInternational guidelines for data privacy protectionCover aspects relating to data collection, purpose, use, quality, openness, individual participation, and accountabilityPurpose specification and use limitationOpenness: Individuals have the right to know what information is collected about them, who has access to the data, and how the data are being used

Develop and use data security-enhancing techniquesBlind signaturesBiometric encryptionAnonymous databases

January 20, 2006 Data Mining: Concepts and Techniques 48

OLAP (Summarization) Display Using MS/Excel 2000

Page 25: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

25

January 20, 2006 Data Mining: Concepts and Techniques 49

Market-Basket-Analysis (Association)—Ball graph

January 20, 2006 Data Mining: Concepts and Techniques 50

Display of Association Rules in Rule Plane Form

Page 26: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

26

January 20, 2006 Data Mining: Concepts and Techniques 51

Display of Decision Tree (Classification Results)

January 20, 2006 Data Mining: Concepts and Techniques 52

Display of Clustering (Segmentation) Results

Page 27: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

27

January 20, 2006 Data Mining: Concepts and Techniques 53

3D Cube Browser

January 20, 2006 Data Mining: Concepts and Techniques 54

Trends in Data Mining (1)Trends in Data Mining (1)

Scalable data mining methodsConstraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns

Application explorationdevelopment of application-specific data mining systemInvisible data mining (mining as built-in function)

Integration of data mining with database systems, data warehousesystems, and Web database systemsQuality assessment

Page 28: Data Mining: Concepts and Techniques - Knowledge and …courses.dbnet.ntua.gr/fsr/5681/data_mining-3.pdf · ©Jiawei Han and Micheline Kamber ... 2006 Data Mining: Concepts and Techniques

28

January 20, 2006 Data Mining: Concepts and Techniques 55

Trends in Data Mining (2)Trends in Data Mining (2)

Standardization of data mining languageA standard will facilitate systematic development, improve interoperability, and promote the education and use of data mining systems in industry and society

Visual data miningUncertainty handlingNew methods for mining complex types of data

More research is required towards the integration of data miningmethods with existing data analysis techniques for the complex types of data

Web miningPrivacy protection and information security in data mining