2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 introduction to data...

52
二二二二二二二二 Introduction to Data Mining 1 Introduction to Data Mining Chen. Chun-Hsien Department of Information Management Chang Gung University

Upload: melanie-smith

Post on 27-Dec-2015

248 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 1

Introduction to Data Mining

Chen. Chun-Hsien

Department of Information Management

Chang Gung University

Page 2: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 2

Outline

Motivation to data mining What is data mining? Applications of data mining Data mining process Main data mining techniques Classification of data mining

systems

Page 3: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 3

Motivation

Data explosion problem Automated data collection tools and mature database

technology 1 million new transactions/per hour in Walmart database Tremendous amount of Web pages

40 billion photos on Facebook Big data in Clouds

We are drowning in data, but starving for knowledge to make decision

Solution: Data Mining One of the 10 emerging technologies that will change

the world in the near future (MIT Technology Review)

Page 4: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 4

What Is Data Mining?

Formal Definition of Data mining

Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, affinities) from large amount of data

Alternative names Business intelligence, knowledge discovery in databases

(KDD), data/pattern analysis, knowledge extraction, data

dredging, information harvesting, data archeology, etc.

Knowledge

Page 5: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 5

Example : Mining a Concept Hierarchy

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

Page 6: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 6

Part of International Sales Data

Region Country City OfficeNorth American USA New York QueenNorth American Canada Vancouver L. ChanNorth American USA L.A. Bay AreaNorth American USA Boston Northern AreaNorth American Canada Toronto CentralNorth American USA Boston Southern AreaNorth American USA New York QueenNorth American USA L.A. Bay AreaNorth American Mexico Mexico City EmpireNorth American Canada Toronto CentralNorth American USA New York Manhattan

Page 7: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 7

Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

ArtificialIntelligence

InformationScience

MachineLearning Visualization

Page 8: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 8

Evolution of Database Technology

1960s: Data collection, database creation, network DBMS

1970s: Relational data model, relational DBMS

1980s: Advanced data models (extended-relational, OO, spatial,

temporal D/Bs, etc.)

1990s ~: Data mining, data warehousing, multimedia D/B, and

Web

Page 9: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 9

Applications of Data Mining

Decision support Consumer understanding, service improvement Market trend analysis and management Risk analysis and management Fraud detection and management Medical decision support systems

Other Applications Text mining Web analysis Biomedical informatics

Page 10: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 10

Market Analysis and Management

Data sources for analysis Credit card transactions, customer questions

(FAQ), customer complaint calls, public lifestyle studies

Market basket analysis and cross selling Associations/co-relations between product sales Prediction based on the association information

(1/2)

Page 11: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 11

Customer profiling Find clusters of “model” customers who share the same

characteristics: interest, spending habits , income level, etc.

Data mining can tell you what types of customers buy what

products (by clustering or classification techniques)

Identifying personalized customer requirements Identifying the best products for different customers

Use prediction to find what factors will attract new customers

Market Analysis and Management

(2/2)

Page 12: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 12

Risk Management and Analysis

Finance planning and asset evaluation Cash flow analysis and prediction Asset evaluation Time series analysis (trend analysis)

Competitive analysis and market

segmentation Monitoring market directions and competitors Setting pricing strategy in a highly competitive market

Grouping customers/a class-based pricing procedure

(Multi-brand, multi-style strategies)

Page 13: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 13

Fraud Detection and Management

Applications Health care, credit card services

Approach use historical data to build models of fraudulent

behavior and use data mining to help identify similar instances

Examples money laundering: detect suspicious money transactions medical insurance: detect professional patients and ring

of doctors

Page 14: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 14

Other Applications

Text Ming News classification : find related articles CRM data analysis : analyze customer Q&As Medical informatics : automatic classification of medical

reports

Web Mining : mining web access logs Analyzing effectiveness of web marketing Improving Web site organization Discovering customer preference and behavior

Biomedical Informatics Finding related genes of genetic diseases Drug discovery

Page 15: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 15

Relevant Data

Data Preprocessing

Data Mining

Evaluation/Presentation

Pattern

Knowledge

Databases

Steps in KDD Process(Technically)

Data mining

The core step of KDD process

Page 16: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 16

Main Steps of a KDD Process(Fully)

Domain knowledge Acquisition Learning relevant prior knowledge and goals of application

Data collection and preprocessing (may take 60% of effort!)

Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction (in Cloud)

Data mining Choosing functions of data mining

association, classification, clustering, regression, summarization.

Choosing the mining algorithm(s) Searching for knowledge patterns of interest

Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Page 17: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 17

Mining On What Kind of Data?

Relational databases Data warehouses Transactional databases Advanced DB and information repositories

Spatial databases Time-series data (temporal data) Text databases and multimedia databases Object-oriented databases Heterogeneous and legacy databases Web sites

Page 18: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 18

Relevant Data

Data Preprocessing

Databases

Steps in KDD Process

Page 19: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 19

Why Data Preprocessing?

Data in the real world is dirty incomplete

lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

noisycontaining errors or outliers

inconsistentcontaining discrepancies in codes or names

No quality data, no quality mining results! Quality decisions must be based on quality data

Page 20: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 20

Major Tasks in Data Preprocessing

Data cleaning

Data integration

Data transformation

Data reduction

Data discretization

Page 21: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 21

Relevant Data

Data Preprocessing

Data Mining

Pattern

Databases

Steps in KDD Process

Page 22: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 22

Main Data Mining Techniques

Association Rule Mining

Classification and Prediction

Cluster Analysis

Regression Analysis

Outlier Analysis

Trend Analysis

Page 23: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 23

Main Data Mining Techniques

Association Rule MiningFind association rules (correlation and causality)

Form of association rules : X Y [s, c] Simple form example : computer software [s= 1%, c =

75%] Detailed form examples

sales(T, “computer”) sales(T, “software”) [support = 1%, confidence = 75%]

buy(T, “Beer”) buy(T, “Diaper”) [support = 2%, confidence = 70%]

age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”) [support = 2%, confidence = 60%]

(1/5)

Page 24: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 24

Association Rule Mining(Support and Confidence)

Given a transaction D/B, find all the rules X Y with minimum support and confidence

support, S, probability that a transaction contains {X & Y }

confidence, C, conditional probability that a transaction having {X} also contains YTransaction ID Items Bought (T)

0001 A,B,C0002 A,C0003 A,D0004 B,E,F

I = {i1,i2,i3, ...,in} : set of all items

T I : a transaction

A C (50%, 66.6%) C A (50%, 100%)

Customersbuy X

Customersbuy both

Customersbuy Y

Page 25: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 25

Use a training set to construct a model for the outcome

forecast of future events. Two main types Classification

Finding models that distinguish classes for future forecast

e.g., loan approval, customer classification, recognition of finger print

Model representation: decision-tree, neural network

Prediction Prediction: Predict some unknown/missing numerical values for

future forecast

e.g., stock price prediction Model representation: linear regression, neural network

(2/5)Main Data Mining TechniquesSupervised Learning

Page 26: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 26

Use a training set to construct a model for the outcome

forecast of future events Classification

predicts categorical class labels (mainly for two-class problems) constructs a classification model to classify new data

Prediction predicts numerical values Constructs a continuous-valued (mathematical) function to

predict unknown or missing values Typical Applications

credit card approval medical diagnosis & treatment Pattern recognition

Classification vs. Prediction

Page 27: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 27

An Example of Classification(Fruit Classifier)

Classifier

output

Class label

oval, red, orange, yellow

shape=roundcolor = red

inputfeatures

Apple

shape=roundcolor = orange

Orange

Mango

Page 28: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 28

A General Classifier

Classifierinputfeatures output

class label

::

Page 29: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 29

Model of Supervised Learning

The model is in a form of )...,,,( 21 nxxxfy

Classifier/Predictor

inputfeatures output

::

x1

x2

xn

y

Main issue: • What are x1, …, xn ?• How to get the model f ?• How to collect training data with output y

Page 30: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 30二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 30

Stages of Classification Tasks (Construction and Usage of Classification

Model)

Model construction

TrainingData(I, O)

ClassificationLearning

Algorithms

ClassifierModel

Model usage

ClassifierModel

inputfeatures output

class label

::

Page 31: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Data Mining: Concepts and Technique

s 31

Classification Methods

Decision Tree Induction Algorithms Bayesian Classifiers Back Propagation Neural Networks SVM—Support Vector Machines k-nearest neighbor classifier

Page 32: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 32

An Example of Training Dataset

age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Class label (O)

Input features (I)

Page 33: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 33

An Example of Classification Model (A decision tree for predicting buys_PC)

no yes fairexcellent

<= 30 > 4030..40

student?

age?

credit rating?

nono yes

yes

yes

: test attribute: class label

: attribute value

?

Page 34: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 34

Extracting Classification Rules from Tree

Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a

conjunction The leaf node holds the class prediction Rule examples

IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer =

“yes”IF age = “>40” AND rating = “fair” THEN buys_computer = “no”IF age = “>40” AND rating = “excellent” THEN buys_computer = “yes”

Page 35: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 35二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 35

A Decision Tree for CAD Screening(Constructed from 500 Records)

Page 36: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 36

Discussion on Extracted Classification Rules

(from Decision Tree for CAD Screening)

在大部分的情形下,綜合高血壓與 hsCRP 的濃度資訊,足以區別的一個人是否是 CAD 的潛在患者,此情形佔 81% ,共 405 例,誤判率是 2.47%(=10/405) 。

結論 : 高血壓與 hsCRP 的濃度是區別一個人是否是 CAD 潛在患者的主要危險因子

部分的情形須進一步綜合 Age 與 HDLc 的資訊,才足以區別的一個人是否是 CAD 的潛在患者,此情形佔 14.6% ,共73 例,誤判率是 5.48%(=4/73) ,誤判率有些升高。

上述兩個情形,共 5 條規則,所涵蓋的比率是 95.6% Rule examples ( 下列 2 條規則所涵蓋的情形佔 81%)

IF 有高血壓 AND hsCRP > 0.316 THEN CAD IF 沒有高血壓 AND hsCRP < 0.545 THEN Noraml

Page 37: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 37

Cluster analysis (unsupervised learning) Class label is unknown: Group data to form

new classes

e.g., Customer profiling (Amazon.com)

Clustering based on the principle:

Maximizing the intra-class similarity and minimizing the interclass similarity

Mainly for exploratory analysis

(3/5)Main Data Mining TechniquesCluster analysis

Page 38: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 38

A

B

C

Difficulty : Data distribution of high dimension is not visually visible.

XY

Z3 clusters with points X, Y, and Z as outliers

Example of Cluster Analysis

Page 39: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 39

Major Clustering Approaches

Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition

for the set of records using some criterion

Density-based: Based on connectivity and density functions

Grid-based: Quantize the data space into a finite number of

cells that form a grid structure on which clustering are

performed

Model-based: A model is hypothesized for each of the

clusters and find the best fit of the records to the given models

Page 40: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 40

Hierarchical Clustering

Use distance matrix as clustering criteria.

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 41: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 41

Decompose the data objects into a several levels of tree clusters, called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

A Dendrogram Shows Hierarchically Merged Clusters

Page 42: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 42

Gene Expression AnalysisGene Expression Analysisby Clusteringby Clustering

Finding differentially regulated genes

Clustering

Page 43: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 43二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 43

Profile of Stroke PatientsProfile of Stroke Patients((Diagnosed by Indices of Chinese Diagnosed by Indices of Chinese

MedicineMedicine))

Page 44: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 44二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 44

x

y

y = x + 1

X1

Y1 ?

Main Data Mining TechniquesExample of Linear Regression

(4/5)

• Predict y’s value at X1

using linear regression• y = f (x), what is f ?

Page 45: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日Data Mining: Concepts and Technique

s 45

Linear regression: Y = + X Two parameters , and specify the line and are

to be estimated by using the training data. Using the least squares criterion to the training

samples: (X1, Y1), (X2, Y2) …, (Xn, Yn)

Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn

Analyze b1 , b2 … bn to find the contribution of each variable

Log-linear models: Example : Estimate probability:

p(a, b, c, d) = αabc abdγacd bcd

log p(a, b, c, d) = log abc +log abd+logγacd +log bcd

Regression Analysis and Log-Linear Models in Prediction

Page 46: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 46

Outlier analysis Outlier: a data object that does not comply with the general

behavior of the data

It can be considered as noise or exception but is quite useful in

fraud detection, rare events analysis

Trend analysis Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Other pattern-directed or statistical

analyses

Other Data Mining Techniques

(5/5)

Page 47: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 47

Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of

patterns, not all of them are interesting.

How to screen a large amount of patterns : Interestingness

measures

A pattern is interesting if it is easily understood, potentially useful,

novel, valid on new or test data with some degree of certainty, or it

validates some hypothesis that a user seeks to confirm

Objective vs. subjective measures for pattern screening Objective: based on statistics and structures of data patterns

e.g., support, confidence, etc. Subjective: based on user’s belief in the data,

e.g., unexpectedness, novelty, actionability, etc.

Page 48: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 48

Can We Find All and Only Interesting Patterns?

Completeness vs. Optimization Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns?

Search for only interesting patterns: Optimization

Can a data mining system find only the interesting patterns?

Approaches

First generate all the patterns and then filter out the uninteresting

ones.

Generate only the interesting patterns—mining query optimization

Page 49: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 49

Classification Scheme of DM Techniques

General functionality Descriptive data mining

Predictive data mining

Different views, different classifications Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted

Page 50: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 50

A Multi-Dimensional View of DM Technique Classification

Databases to be mined Relational, transactional, object-oriented, object-relational,

active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.

Knowledge to be mined Association, classification, clustering, trend, characterization,

discrimination, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels

Techniques utilized Database-oriented, data warehouse (OLAP), machine learning,

statistics, visualization, etc. Applications adapted

Retail, banking, stock market analysis, telecommunication, fraud analysis, Web mining, biomedical informatics, etc.

Page 51: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 51

Summary for Data Mining

Data mining: automatic discovery of interesting knowledge from large amounts of data

A natural evolution of database technology, in great demand, with wide applications

A KDD process includes data pre-processing, data mining, pattern evaluation, and knowledge presentation

Main data mining functions: ARM, classification, clustering, outlier and trend analysis, characterization, etc.

Page 52: 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日

Main Data Mining Techniques for Biomedical Informatics 資料探勘

Thank You !!!!

Have a Nice Day !