overview - university of michiganjizhu/jizhu/fudan/s415overview.pdf · Œ provide better,...

32
Stats 415 - Overview Ji Zhu, Michigan Statistics 1 Overview Ji Zhu 445C West Hall 734-936-2577 [email protected]

Upload: others

Post on 12-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 1

Overview

Ji Zhu445C West Hall

[email protected]

Page 2: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 2

What is Data Mining?• Data mining is a multi-disciplinary field of study

concerned with the design of algorithms that allowcomputers to learn from large data repositories.

• Non-trivial extraction of implicit, previouslyunknown and potentially useful information fromdata

• There are many other definitions.

Page 3: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 3

Data Mining Examples and Non-Examples• Data mining

– Certain names are more prevalent in certain USlocations (O’Brien, O’Rurke, O’Reilly... in Bostonarea)

– Group together similar documents returned bysearch engine according to their context (e.g.Amazon rainforest, Amazon.com, etc.)

• Not data mining– Look up phone number in phone directory– Query a web search engine for information

about “Amazon”

Page 4: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 4

Page 5: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 5

Why Mine Data? Scientific Viewpoint• Data collected and stored at enormous speeds

(GB/hour)– remote sensors on a satellite (NASA)– telescopes scanning the skies (SDSS)– microarrays generating gene expression data

(MEDLINE)– scientific simulations generating terabytes of

data (GIS)

Page 6: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 6

• Data mining may help scientists– in classifying and segmenting data– in hypothesis formation– etc

Page 7: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 7

Why Mine Data? Commercial Viewpoint• Lots of data is being collected and warehoused

– Web data, e-commerce (Google, Yahoo, Amazon,Ebay)

– Purchases at department and grocery stores(Walmart)

– Bank/credit card transactions (Bank of America,Visa, Mastercard)

Page 8: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 8

• Competitive pressure is strong– Provide better, customized services for an edge

Page 9: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 9

Mining Large Data Sets - Motivation• There is often information “hidden” in the data that

is not readily evident.

• Human analysts may take months to discoveruseful information.

• Much of the data is never analyzed at all.

Page 10: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 10

Technological Driving Factors• Larger, cheaper memory

• Faster, cheaper processors– the CRAY of 20 years ago is now on your desk

• Success of relational databases and the Web– everybody is a “data owner”

• New ideas in machine learning and statistics

Page 11: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 11

Origins of Data Mining• Draws ideas from machine learning, pattern

recognition, statistics and database systems.

Statistics

DatabaseSystems

MachineLearning/PatternRecognition

Data Mining

Page 12: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 12

• Traditional techniques may be unsuitable due to– enormity of data– high dimensionality of data– heterogeneous, distributed nature of data

Page 13: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 13

Data Mining vs Statistics• Traditional statistics

– first hypothesize, then collect data, then analyze– often model-oriented

• Data mining– few if any a priori hypothesis– often algorithm-oriented rather than

model-oriented

Page 14: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 14

• Different?– Yes, in terms of culture, motivation; however...– Statistical ideas are very useful in data mining, e.g.,

in validating whether discovered knowledge isuseful

• Increasing overlap at the boundary of statistics anddata mining: use the tools of probability andstatistics to provide a mathematical framework for– posing data mining problems– formulating solutions to those problems

Page 15: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 15

Data Mining vs Machine Learning• To first-order, very little difference...

– Data mining relies heavily on ideas frommachine learning (and from statistics)

• Some differences between data mining andmachine learning– More emphasis in data mining on scalability– Data mining is somewhat more

applications-oriented

Page 16: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 16

Two Types of Data Mining Tasks• Prediction methods: Use some variables to predict

unknown or future values of other variables

• Description methods: Find human-interpretablepatterns that describe the data

Page 17: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 17

Examples of Data Mining Tasks• Visualization (Descriptive)

• Classification (Predictive)

• Regression (Predictive)

• Association analysis (Descriptive)

• Clustering (Descriptive)

Page 18: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 18

Classification: Definition• Given a collection of data points (training set)

– Each data point contains a set of variables, oneof the variables is the class

• Find a model for the class variable as a function ofthe values of other variables

• Goal: previously unseen data points should beassigned a class as accurately as possible– A test set is used to determine the accuracy of

the model

Page 19: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 19

Classification Example: Customer Scoring• Example: a bank has a database of 1 million past

customers, 10% of whom took out mortgages

• Use data mining to predict whether a newcustomer will take out a “mortgage or not” basedon the customer data

• Customer data– Other credit data– Demographic data on the customer

Page 20: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 20

Classification Example: Spam DetectionCustomize an email spam detection system forindividual user. Relative frequencies in a message ofmost commonly occurring words and punctuationmarks.

george you your hp free re removespam 0.00 2.26 1.38 0.02 0.52 0.13 0.28email 1.27 1.27 0.44 0.90 0.07 0.42 0.01

Page 21: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 21

Classification Example: Microarray

SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104

BREAS

TRE

NAL

MELAN

OMA

MELAN

OMA

MCF7D

-repro

COLON

COLON

K562B-

repro

COLON NSCLC

LEUKEM

IARE

NAL

MELAN

OMA

BREAS

TCN

SCN

SRE

NAL

MCF7A

-repro

NSCLC

K562A-

repro

COLON CN

SNS

CLCNS

CLCLEU

KEMIA CNS

OVAR

IANBR

EAST

LEUKEM

IAME

LANOM

AME

LANOM

AOV

ARIAN

OVAR

IANNS

CLC RENA

LBR

EAST

MELAN

OMA

OVAR

IANOV

ARIAN

NSCLC RENA

LBR

EAST

MELAN

OMA

LEUKEM

IACO

LONBR

EAST

LEUKEM

IACO

LON CNS

MELAN

OMA

NSCLC

PROS

TATE

NSCLC RENA

LRE

NAL

NSCLC RENA

LLEU

KEMIA

OVAR

IANPR

OSTAT

ECO

LONBR

EAST

RENA

LUN

KNOW

N

Page 22: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 22

Deviation/Anomaly Detection• Detect significant deviations from normal behavior

• Applications:– Credit card fraud detection– Network intrusion detection

Page 23: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 23

Fraud Detection• Credit card fraud detection

– Credit card losses in the US are over 1 billion $per year

– Roughly 1 in 50k transactions are fraudulent

• Fair-Issac’s fraud detection software based onneural networks, led to reported fraud decreases of30 to 50%

• Issues: false alarm rate vs missed detection – whatis the tradeoff?

Page 24: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 24

Regression• Predict a value of a given continuous valued

variable based on the values of other variables,assuming a linear or nonlinear model ofdependency

• Examples– Predicting sales amounts of new product based

on advertising expenditure– Predicting wind velocities as a function of

temperature, humidity, air pressure, etc– Time series prediction of stock market indices

Page 25: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 25

Clustering: Definition• Given a set of data points, each having a set of

variables, find clusters such that– data points in one cluster are more “similar” to

one another, and– data points in separate clusters are “less similar”

to one another.

Page 26: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 26

• Similarity measures– Euclidean distance if variables are continuous– Other problem-specific measures

Page 27: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 27

Clustering Example: Market Segmentation• Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably beselected as a market target to be reached with adistinct marketing mix.

• Collect different variables of customers based ontheir geographical and lifestyle related information

• Find clusters of similar customers

Page 28: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 28

Clustering Examples• Document clustering: find groups of documents

that are similar to each other based on theimportant terms appearing in them

• Clustering stocks based on their movements everyday

Page 29: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 29

Association Rule Discovery: Definition• Given a set of records each of which contains some

number of items from a given collection– Produce dependency rules which will predict

occurrence of an item based on occurrences ofother items

• Goal is to discover interesting local patterns in thedata rather than to characterize the data globally

Page 30: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 30

ID Items1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules discovered

• {Coke} =⇒ {Milk}

• {Diaper, Milk} =⇒ {Beer}

Page 31: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 31

Association Rule Discovery Example• Supermarket shelf management

– Goal: To identify items that are bought togetherby sufficiently many customers

– A classic rule: If a customer buys diaper andmilk, then he is very likely to buy beer

• Amazon, Netflix

Page 32: Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better, customized services for an edge. Stats 415 - Overview Ji Zhu, Michigan Statistics 9 Mining

Stats 415 - Overview Ji Zhu, Michigan Statistics 32

Data Mining: the Downside• Hype?

• Data dredging, snooping and fishing– Finding spurious structure in data that is not

real

• Historically, “data mining” was derogatory term inthe statistics community– Rhine paradox– The Super Bowl fallacy– Bangladesh butter prices and the US stock

market