data mining theory and practice dr. azuraliza abu bakar

Post on 18-Dec-2015

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Mining

Theory and Practice

Dr. Azuraliza Abu Bakar

http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm

What is Pattern Recognition

Pattern Recognition by Human– perceptual – specialized – decision making

Pattern Recognition by Computers– benefit of automated pattern recognition– advantage in complex calculations

Pattern Recognition from Data (Data Mining)

Pattern Recognition from Data

Pattern recognition from data is a process of learning or observing the past data by studying the dependencies and extracting knowledge from data

What is Data?

 

Studies Education Works Income (D)

1 Poor SPM Poor None

2 Poor SPM Good Low

3 Moderate SPM Poor Low

4 Moderate Diploma Poor Low

5 Poor SPM Poor None

6 Moderate Diploma Poor Low

7 Good MSC Good Medium

:

99 Poor SPM Good Low

100 Moderate Diploma Poor Low

What is Knowledge??studies(Poor) AND work(Poor) => income(None)

studies(Poor) AND work(Good) => income(Low)

education(Diploma) => income(Low)

education(MSc) => income(Medium) OR income(High)

studies(Mod) => income(Low)

studies(Good) => income(Medium) OR income(High)

education(SPM) AND work(Good) => income(Low)

What is Data Mining??

Extraction of knowledge from data

exploration and analysis of large quantities of data to discover meaningful pattern from data.

Discover Knowledge

How data mining looks into data??

Data DataData

Data Mining : Motivation

Huge amounts of data

Important need for turning data into useful information

Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools

Questions??

What goods should be promoted to this customer?

What is the probability that a certain customer will respond to a planned promotion?

Can one predict the most profitable securities to buy/sell during the next trading session?

Will this customer default on a loan or pay back on schedule?

What medical diagnose should be assigned to this patient?

What kind of cars should be sell this year??

Data Mining is simply...

Finds relationship

make prediction

Data Mining : 1-step of KDD

Task

KDD

Data mining

Techniques

Data Mining as a Step of KDD

Patterns

DataWarehouse

Databases Flat files

Selection and Transformation

Data Mining

Evaluation & Presentation

Cleaning and Intergration

Knowledge

Early Steps of Data Mining

Data preprocessing– handling incomplete data, noisy data, uncertain

data Data discretization/representation

– transforms data into suitable values for the mining algorithm to find patterns

Data selection– selects the suitable data for mining purposes

Data Mining Techniques

Decision Trees

Neural Network

Genetic Algorithms

Fuzzy Set Theory

Rough Set Theory

Statistical Method (Regression Analysis)

Kinds of DB

RelationalData warehouseTransactional DBAdvanced DB systemFlat filesWWW

Kinds of Knowledge

ClassificationAssociationClusteringPrediction……

Classification of Data Mining Systems

Classification of Data Mining Systems

Techniques used

DB oriented techniquesStatisticMachine learningPattern recognitionNeural NetworkRough Set etc

Application adapted

FinanceMarketingMedicalStockTelecommunication, etc

Data Mining: confluence of multiple discipline

DATA MINING

Database technology

statistic

Machine learning

Informationscience

Neural network

Pattern recognition

visualization Information retrieval

HPerformance computing

Spatial data analysis

Data Mining

What we are looking at??

What we are looking for??

Data Mining Tasks

– Prediction– Classification– Clustering– Association Rules– Sequential Analysis– Deviation analysis– Similarity analysis– Trend analysis

Classification

Classificationalgorithm

Training data

Studies Education Works Income (D)

1 Poor SPM Poor None

2 Poor SPM Good Low

3 Moderate SPM Poor Low

4 Moderate Diploma Poor Low

5 Poor SPM Poor None

6 Moderate Diploma Poor Low

7 Good MSC Good Medium

:

99 Poor SPM Good Low

100 Moderate Diploma Poor Low

Classification Rules

If studies=“poor” and work=“poor” then Income=“poor”

Classification

Test data

Studies Education Works Income (D)

Moderate Diploma Poor ?

Poor SPM Poor ?

Moderate Diploma Poor ?

Good MSC Good ?

:

New data

studies=“poor” and work=“poor”

Classificationrules

poor

classify

Type of Classifiers

Statistical ClassifierStatistical Classifier–Bayesion approach–Multiple Regression–K-nearest neighbour–Naïve Bayes–Causal Network–Discriminant Analysis

Neural ClassifierNeural Classifier–Hopfield Network–Multilayer Perceptron–Radial Basis Function–Kohonen Networks

Rough Classifier

DATASET

 

Studies Education Works Income (D)

1 Poor SPM Poor None

2 Poor SPM Good Low

3 Moderate SPM Poor Low

4 Moderate Diploma Poor Low

5 Poor SPM Poor None

6 Moderate Diploma Poor Low

7 Good MSC Good Medium

:

99 Poor SPM Good Low

100 Moderate Diploma Poor Low

RULESstudies(Poor) AND work(Poor) => income(None)

studies(Poor) AND work(Good) => income(Low)

education(Diploma) => income(Low)

education(MSc) => income(Medium) OR income(High)

studies(Mod) => income(Low)

studies(Good) => income(Medium) OR income(High)

education(SPM) AND work(Good) => income(Low)

Comparing Classifiers

Predictive Accuracy Speed Robustness Scalability Interpretability

Data Mining : Problems and Challenges

Noisy data

Difficult Training

Set

Dynamic Databases

Large Databases

Incomplete Data

Performance Issues

Cost of the Learning

Set

Time and Memory Constraint

Predictive Ability

Performance Issues

Cost of the Learning

Set

-number of examples necessary for training

-cost of assuring the good accuracy

Performance Issues

Time and Memory Constraint

-time complexity of the learning phase

-time taken for evaluation

-time it takes to reach a certain level of accuracy

Performance Issues

Predictive Ability

-to be able to predict the correct decision towards the test or unseen data

-involve the generation of rules

-measuring the quality or accuracy of rules

DATA

AGE

SEX CP TRESTBPS

CHOL

FBS

RESTECG THALACH

EXANG

OLDPEAK

SLOPE CA

THAL DISEASE

1 63 Male Typical angina

145 233 T LV hyper 150 No 2.3 Downslope

0 Fixed No

2 67 Male Asymp 160 286 F LV hyper 108 Yes 1.5 Flat 3 Normal Yes

3 67 Male Asymp 120 229 F LV hyper 129 Yes 2.6 Flat 2 Reversable

Yes

4 37 Male Non-anginal 130 250 F Normal 187 No 3.5 Downslope

0 Normal No

5 41 Female

Atypical 130 204 F LV hyper 172 No 1.4 Upsloping

0 Normal No

6 56 Male Atypical 120 236 F Normal 178 No 0.8 Upsloping

0 Normal No

7 62 Female

Asymp 140 268 F LV hyper 160 No 3.6 Downslope

2 Normal Yes

8 57 Female

Asymp 120 354 F Normal 163 Yes 0.6 Upsloping

0 Normal No

9 63 Male Asymp 130 254 F LV hyper 147 No 1.4 Flat 1 Reversable

Yes

10 53 Male Asymp 140 203 T LV hyper 155 Yes 3.1 Downslope

0 Reversable

Yes

11 57 Male Asymp 140 192 F Normal 148 No 0.4 Flat 0 Fixed defect

No

12 56 Female

Atypical 140 294 F LV hyper 153 No 1.3 Flat 0 Normal No

13 56 Male Non-anginal 130 256 T LV hyper 142 Yes 0.6 Flat 1 Fixed defect

Yes

14 44 Male Atypical 120 263 F Normal 173 No 0 Upsloping

0 Reversable

No

15 52 Male Non-anginal 172 199 T Normal 162 No 0.5 Upsloping

0 Reversable

No

16 57 Male Non-anginal 150 168 F Normal 174 No 1.6 Upsloping

0 Normal No

17 48 Male Atypical 110 229 F Normal 168 No 1 Downslope

0 Reversable

Yes

18 54 Male Asymp 140 239 F Normal 160 No 1.2 Upsloping

0 Normal No

19 48 Female

Non-anginal 130 275 F Normal 139 No 0.2 Upsloping

0 Normal No

20 49 Male Atypical 130 266 F Normal 171 No 0.6 Upsloping

0 Normal No

Samples of the CLEV Dataset (before scaling)

oldpeak(0.7) => disease(No)

oldpeak(4.4) => disease(Yes)

chol(233) AND restecg(LV hypertrophy) => disease(No)

chol(204) AND restecg(LV hypertrophy) => disease(No)

chol(236) AND restecg(Normal) => disease(No)

chol(203) AND restecg(LV hypertrophy) => disease(Yes)

chol(294) AND restecg(LV hypertrophy) => disease(No)

chol(275) AND restecg(Normal) => disease(No)

chol(266) AND restecg(Normal) => disease(No)

chol(247) AND restecg(Normal) => disease(No)

chol(219) AND restecg(LV hypertrophy) => disease(No)

chol(266) AND restecg(LV hypertrophy) => disease(Yes)

chol(304) AND restecg(Normal) => disease(No)

chol(254) AND restecg(Normal) => disease(Yes)

chol(267) AND restecg(Normal) => disease(Yes)

chol(264) AND restecg(LV hypertrophy) => disease(No)

chol(234) AND restecg(LV hypertrophy) => disease(No)

Rules generated from data mining process

top related