Download - 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

二〇二三年四月十九日 Introduction to Data Mining 1

Introduction to Data Mining

Chen. Chun-Hsien

Department of Information Management

Chang Gung University


Outline

Motivation to data mining What is data mining? Applications of data mining Data mining process Main data mining techniques Classification of data mining

systems


Motivation

Data explosion problem Automated data collection tools and mature database

technology 1 million new transactions/per hour in Walmart database Tremendous amount of Web pages

40 billion photos on Facebook Big data in Clouds

We are drowning in data, but starving for knowledge to make decision

Solution: Data Mining One of the 10 emerging technologies that will change

the world in the near future (MIT Technology Review)


What Is Data Mining?

Formal Definition of Data mining

Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, affinities) from large amount of data

Alternative names Business intelligence, knowledge discovery in databases

(KDD), data/pattern analysis, knowledge extraction, data

dredging, information harvesting, data archeology, etc.

Knowledge


Example : Mining a Concept Hierarchy

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity


Part of International Sales Data

Region Country City OfficeNorth American USA New York QueenNorth American Canada Vancouver L. ChanNorth American USA L.A. Bay AreaNorth American USA Boston Northern AreaNorth American Canada Toronto CentralNorth American USA Boston Southern AreaNorth American USA New York QueenNorth American USA L.A. Bay AreaNorth American Mexico Mexico City EmpireNorth American Canada Toronto CentralNorth American USA New York Manhattan


Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

ArtificialIntelligence

InformationScience

MachineLearning Visualization


Evolution of Database Technology

1960s: Data collection, database creation, network DBMS

1970s: Relational data model, relational DBMS

1980s: Advanced data models (extended-relational, OO, spatial,

temporal D/Bs, etc.)

1990s ~: Data mining, data warehousing, multimedia D/B, and

Web


Applications of Data Mining

Decision support Consumer understanding, service improvement Market trend analysis and management Risk analysis and management Fraud detection and management Medical decision support systems

Other Applications Text mining Web analysis Biomedical informatics


Market Analysis and Management

Data sources for analysis Credit card transactions, customer questions

(FAQ), customer complaint calls, public lifestyle studies

Market basket analysis and cross selling Associations/co-relations between product sales Prediction based on the association information

(1/2)


Customer profiling Find clusters of “model” customers who share the same

characteristics: interest, spending habits , income level, etc.

Data mining can tell you what types of customers buy what

products (by clustering or classification techniques)

Identifying personalized customer requirements Identifying the best products for different customers

Use prediction to find what factors will attract new customers

Market Analysis and Management

(2/2)


Risk Management and Analysis

Finance planning and asset evaluation Cash flow analysis and prediction Asset evaluation Time series analysis (trend analysis)

Competitive analysis and market

segmentation Monitoring market directions and competitors Setting pricing strategy in a highly competitive market

Grouping customers/a class-based pricing procedure

(Multi-brand, multi-style strategies)


Fraud Detection and Management

Applications Health care, credit card services

Approach use historical data to build models of fraudulent

behavior and use data mining to help identify similar instances

Examples money laundering: detect suspicious money transactions medical insurance: detect professional patients and ring

of doctors


Other Applications

Text Ming News classification : find related articles CRM data analysis : analyze customer Q&As Medical informatics : automatic classification of medical

reports

Web Mining : mining web access logs Analyzing effectiveness of web marketing Improving Web site organization Discovering customer preference and behavior

Biomedical Informatics Finding related genes of genetic diseases Drug discovery


Relevant Data

Data Preprocessing

Data Mining

Evaluation/Presentation

Pattern

Knowledge

Databases

Steps in KDD Process(Technically)

Data mining

The core step of KDD process


Main Steps of a KDD Process(Fully)

Domain knowledge Acquisition Learning relevant prior knowledge and goals of application

Data collection and preprocessing (may take 60% of effort!)

Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction (in Cloud)

Data mining Choosing functions of data mining

association, classification, clustering, regression, summarization.

Choosing the mining algorithm(s) Searching for knowledge patterns of interest

Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge


Mining On What Kind of Data?

Relational databases Data warehouses Transactional databases Advanced DB and information repositories

Spatial databases Time-series data (temporal data) Text databases and multimedia databases Object-oriented databases Heterogeneous and legacy databases Web sites


Relevant Data

Data Preprocessing

Databases

Steps in KDD Process


Why Data Preprocessing?

Data in the real world is dirty incomplete

lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

noisycontaining errors or outliers

inconsistentcontaining discrepancies in codes or names

No quality data, no quality mining results! Quality decisions must be based on quality data


Major Tasks in Data Preprocessing

Data cleaning

Data integration

Data transformation

Data reduction

Data discretization


Relevant Data

Data Preprocessing

Data Mining

Pattern

Databases

Steps in KDD Process


Main Data Mining Techniques

Association Rule Mining

Classification and Prediction

Cluster Analysis

Regression Analysis

Outlier Analysis

Trend Analysis


Main Data Mining Techniques

Association Rule MiningFind association rules (correlation and causality)

Form of association rules : X Y [s, c] Simple form example : computer software [s= 1%, c =

75%] Detailed form examples

sales(T, “computer”) sales(T, “software”) [support = 1%, confidence = 75%]

buy(T, “Beer”) buy(T, “Diaper”) [support = 2%, confidence = 70%]

age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”) [support = 2%, confidence = 60%]

(1/5)


Association Rule Mining(Support and Confidence)

Given a transaction D/B, find all the rules X Y with minimum support and confidence

support, S, probability that a transaction contains {X & Y }

confidence, C, conditional probability that a transaction having {X} also contains YTransaction ID Items Bought (T)

0001 A,B,C0002 A,C0003 A,D0004 B,E,F

I = {i1,i2,i3, ...,in} : set of all items

T I : a transaction

A C (50%, 66.6%) C A (50%, 100%)

Customersbuy X

Customersbuy both

Customersbuy Y


Use a training set to construct a model for the outcome

forecast of future events. Two main types Classification

Finding models that distinguish classes for future forecast

e.g., loan approval, customer classification, recognition of finger print

Model representation: decision-tree, neural network

Prediction Prediction: Predict some unknown/missing numerical values for

future forecast

e.g., stock price prediction Model representation: linear regression, neural network

(2/5)Main Data Mining TechniquesSupervised Learning


Use a training set to construct a model for the outcome

forecast of future events Classification

predicts categorical class labels (mainly for two-class problems) constructs a classification model to classify new data

Prediction predicts numerical values Constructs a continuous-valued (mathematical) function to

predict unknown or missing values Typical Applications

credit card approval medical diagnosis & treatment Pattern recognition

Classification vs. Prediction

二〇二三年四月十九日Main Data Mining Techniques for Bio

medical Informatics 27

An Example of Classification(Fruit Classifier)

Classifier

output

Class label

oval, red, orange, yellow

shape=roundcolor = red

inputfeatures

Apple

shape=roundcolor = orange

Orange

Mango



A General Classifier

Classifierinputfeatures output

class label

::



Model of Supervised Learning

The model is in a form of )...,,,( 21 nxxxfy

Classifier/Predictor

inputfeatures output

::

x1

x2

xn

y

Main issue: • What are x1, …, xn ?• How to get the model f ?• How to collect training data with output y

二〇二三年四月十九日 Introduction to Data Mining 30二〇二三年四月十九日 Main Data Mining Techniques for Biomedical Informatics 30

Stages of Classification Tasks (Construction and Usage of Classification

Model)

Model construction

TrainingData(I, O)

ClassificationLearning

Algorithms

ClassifierModel

Model usage

ClassifierModel

inputfeatures output

class label

::

二〇二三年四月十九日Data Mining: Concepts and Technique

s 31

Classification Methods

Decision Tree Induction Algorithms Bayesian Classifiers Back Propagation Neural Networks SVM—Support Vector Machines k-nearest neighbor classifier


An Example of Training Dataset

age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Class label (O)

Input features (I)


An Example of Classification Model (A decision tree for predicting buys_PC)

no yes fairexcellent

<= 30 > 4030..40

student?

age?

credit rating?

nono yes

yes

yes

: test attribute: class label

: attribute value

?



Extracting Classification Rules from Tree

Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a

conjunction The leaf node holds the class prediction Rule examples

IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer =

“yes”IF age = “>40” AND rating = “fair” THEN buys_computer = “no”IF age = “>40” AND rating = “excellent” THEN buys_computer = “yes”


A Decision Tree for CAD Screening(Constructed from 500 Records)



Discussion on Extracted Classification Rules

(from Decision Tree for CAD Screening)

在大部分的情形下，綜合高血壓與 hsCRP 的濃度資訊，足以區別的一個人是否是 CAD 的潛在患者，此情形佔 81% ，共 405 例，誤判率是 2.47%(=10/405) 。

結論 : 高血壓與 hsCRP 的濃度是區別一個人是否是 CAD 潛在患者的主要危險因子

部分的情形須進一步綜合 Age 與 HDLc 的資訊，才足以區別的一個人是否是 CAD 的潛在患者，此情形佔 14.6% ，共73 例，誤判率是 5.48%(=4/73) ，誤判率有些升高。

上述兩個情形，共 5 條規則，所涵蓋的比率是 95.6% Rule examples ( 下列 2 條規則所涵蓋的情形佔 81%)

IF 有高血壓 AND hsCRP > 0.316 THEN CAD IF 沒有高血壓 AND hsCRP < 0.545 THEN Noraml


Cluster analysis (unsupervised learning) Class label is unknown: Group data to form

new classes

e.g., Customer profiling (Amazon.com)

Clustering based on the principle:

Maximizing the intra-class similarity and minimizing the interclass similarity

Mainly for exploratory analysis

(3/5)Main Data Mining TechniquesCluster analysis


A

B

C

Difficulty : Data distribution of high dimension is not visually visible.

XY

Z3 clusters with points X, Y, and Z as outliers

Example of Cluster Analysis



Major Clustering Approaches

Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition

for the set of records using some criterion

Density-based: Based on connectivity and density functions

Grid-based: Quantize the data space into a finite number of

cells that form a grid structure on which clustering are

performed

Model-based: A model is hypothesized for each of the

clusters and find the best fit of the records to the given models



Hierarchical Clustering

Use distance matrix as clustering criteria.

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)



Decompose the data objects into a several levels of tree clusters, called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

A Dendrogram Shows Hierarchically Merged Clusters


Gene Expression AnalysisGene Expression Analysisby Clusteringby Clustering

Finding differentially regulated genes

Clustering


Profile of Stroke PatientsProfile of Stroke Patients((Diagnosed by Indices of Chinese Diagnosed by Indices of Chinese

MedicineMedicine))


x

y

y = x + 1

X1

Y1 ?

Main Data Mining TechniquesExample of Linear Regression

(4/5)

• Predict y’s value at X1

using linear regression• y = f (x), what is f ?

二〇二三年四月十九日Data Mining: Concepts and Technique

s 45

Linear regression: Y = + X Two parameters , and specify the line and are

to be estimated by using the training data. Using the least squares criterion to the training

samples: (X1, Y1), (X2, Y2) …, (Xn, Yn)

Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn

Analyze b1 , b2 … bn to find the contribution of each variable

Log-linear models: Example : Estimate probability:

p(a, b, c, d) = αabc abdγacd bcd

log p(a, b, c, d) = log abc +log abd+logγacd +log bcd

Regression Analysis and Log-Linear Models in Prediction


Outlier analysis Outlier: a data object that does not comply with the general

behavior of the data

It can be considered as noise or exception but is quite useful in

fraud detection, rare events analysis

Trend analysis Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Other pattern-directed or statistical

analyses

Other Data Mining Techniques

(5/5)


Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of

patterns, not all of them are interesting.

How to screen a large amount of patterns : Interestingness

measures

A pattern is interesting if it is easily understood, potentially useful,

novel, valid on new or test data with some degree of certainty, or it

validates some hypothesis that a user seeks to confirm

Objective vs. subjective measures for pattern screening Objective: based on statistics and structures of data patterns

e.g., support, confidence, etc. Subjective: based on user’s belief in the data,

e.g., unexpectedness, novelty, actionability, etc.


Can We Find All and Only Interesting Patterns?

Completeness vs. Optimization Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns?

Search for only interesting patterns: Optimization

Can a data mining system find only the interesting patterns?

Approaches

First generate all the patterns and then filter out the uninteresting

ones.

Generate only the interesting patterns—mining query optimization


Classification Scheme of DM Techniques

General functionality Descriptive data mining

Predictive data mining

Different views, different classifications Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted


A Multi-Dimensional View of DM Technique Classification

Databases to be mined Relational, transactional, object-oriented, object-relational,

active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.

Knowledge to be mined Association, classification, clustering, trend, characterization,

discrimination, deviation and outlier analysis, etc. Multiple/integrated functions and mining at multiple levels

Techniques utilized Database-oriented, data warehouse (OLAP), machine learning,

statistics, visualization, etc. Applications adapted

Retail, banking, stock market analysis, telecommunication, fraud analysis, Web mining, biomedical informatics, etc.


Summary for Data Mining

Data mining: automatic discovery of interesting knowledge from large amounts of data

A natural evolution of database technology, in great demand, with wide applications

A KDD process includes data pre-processing, data mining, pattern evaluation, and knowledge presentation

Main data mining functions: ARM, classification, clustering, outlier and trend analysis, characterization, etc.

二〇二三年四月十九日

Main Data Mining Techniques for Biomedical Informatics 資料探勘

Thank You !!!!

Have a Nice Day !

Download - 2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University

Top Related