lecture 01, introduction

90
INTRODUCTION TO DATA MINING Hari Sundaram adapted from slides by Jiawei Han and Kevin Chang [email protected] http://sundaram.cs.illinois.edu

Upload: aab023

Post on 14-Jul-2016

9 views

Category:

Documents


1 download

DESCRIPTION

INTRODUCTION TO DATA MINING by Hari Sundaram Adapted from: slides by Jiawei Han and Kevin Chang

TRANSCRIPT

Page 1: Lecture 01, Introduction

INTRODUCTION TO DATA MINING

Hari Sundaram

adapted from slides by Jiawei Han and Kevin Chang

[email protected]://sundaram.cs.illinois.edu

Page 2: Lecture 01, Introduction

DAIS@UIUC: DATA MINING,

DATABASE SYSTEMS, TEXT INFORMATION

SYSTEMS, NETWORKS

Different classes in Database and Information Systems

2

Zhai

Parameswaran

Han

Chang

Sundaram

Page 3: Lecture 01, Introduction

DATA MINING

Intro. to data mining (CS412: Han, Chang, Sundaram, Spring and Fall)

Data mining: Principles and algorithms (CS512: Han, Chang, Spring and Fall)

Seminar: Advanced Topics in Data mining (CS591Han: Fall and Spring, 1 credit)

3

Page 4: Lecture 01, Introduction

DATABASE SYSTEMS

Introduction to database systems (CS411: Chang, Parameswaran, Sinha — Spring and Fall)

Advanced database systems (CS511: Chang, Parameswaran — Fall or Spring)

Seminar: Human in the Loop Data Management (CS598: Parmeswaran, Fall)

4

Page 5: Lecture 01, Introduction

TEXT INFORMATION SYSTEMS

Text information system (CS410 Zhai: Spring)

Advanced text information systems (CS510) Zhai: Fall)

5

Page 6: Lecture 01, Introduction

NETWORKS + ADVERTISING

Seminar: Social & Information Networks (CS598, Sundaram, Fall, every two years)

Social & Information Networks (CS498, Sundaram, Fall, every two years; next class: Fall 2016)

Computational Advertising (CS498, Spring every two years, starting Spring 2017)

6

Page 7: Lecture 01, Introduction

Keep in MindBIOINFORMATICS

YAHOO DAIS SEMINAR

7

Page 8: Lecture 01, Introduction

CS412 CLASS MECHANICSEverything you wanted to

know 8

Page 9: Lecture 01, Introduction

class website:

9

https://wiki.cites.illinois.edu/wiki/display/cs412sp16/Syllabus

Page 10: Lecture 01, Introduction

lectures are online

10

http://bit.ly/1ZBxRuY

Page 11: Lecture 01, Introduction

BUT..11

Page 12: Lecture 01, Introduction

WHY YOU SHOULD COME

A lot of research shows that students who come to class tend to score better in exams

We’ll be solving problems in class that will help with understanding of the material

12

Page 13: Lecture 01, Introduction

sign up on piazza!

13

piazza.com/illinois/spring2016/cs412

Page 14: Lecture 01, Introduction

assignments

14

Written assignments (3) 7x3=21 points 5x3=15%

Programming assignments (2) 13.5x2=27 points 10x2=20%

Page 15: Lecture 01, Introduction

exams

15

Mid Term Exam 41 points 30%

Final Exam 48 points 35%

Page 16: Lecture 01, Introduction

project

16

Master’s students (mandatory) 34 points 25%

extra credit, undergrad 13.5 points 10%

rest of the grade scaled to 75%

kaggle.com

Page 17: Lecture 01, Introduction

GRADING

Will grade on a curve

Max points is 137; makes no difference to the grade, but there is empirical evidence to suggest that students are happier.

Will grade undergrads and grads on the same curve— there is no difference in performance.

Grad students taking the 4 credit class will need to do an extra project worth 25% of the grade

17

Page 18: Lecture 01, Introduction

regular feedback

18

why are you excited about this class?

what are you concerned about?

Page 19: Lecture 01, Introduction

MEET THE TA’S

Meet the TA’s

19

De Liao

Xiangyu “Joe” Chen

Page 20: Lecture 01, Introduction

Hari Sundaram Associate Professor, Department of Computer Science

Web: http://sundaram.cs.illinois.edu/ Email: [email protected]

What I do: design algorithms and build systems to understand and to shape crowd behavior in large scale social networks

Current Projects • Belief Evolution • Continuum representations for

networks (Meidani, Dey) • Persuasive message synthesis

(Karahalios) • Data Inference with privacy

policies (Kravets) • Sampling of large graphs • Network Summarization (Chang) • Crowd Clustering (Parmeswaran)

Page 21: Lecture 01, Introduction

Why are you taking this class?

21

Page 22: Lecture 01, Introduction

what do you want to do?22

wade

swim

dive

Page 23: Lecture 01, Introduction

WHY DATA MINING?

23

Page 24: Lecture 01, Introduction

There is an explosive growth of data: from

terabytes to petabytes

24

Page 25: Lecture 01, Introduction

MAJOR SOURCES OF DATA

Business:

Web, e-commerce, transactions, stocks, …

Science:

Remote sensing, astronomy, bioinformatics, scientific simulation, …

Society and everyone:

news, digital cameras, YouTube

25

Page 26: Lecture 01, Introduction

“We are drowning in data, but starving for knowledge

-John Naisbitt, 1982.

Page 27: Lecture 01, Introduction

WHAT IS DATA

MINING?27

Page 28: Lecture 01, Introduction

Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge from huge

amount of data

28

Page 29: Lecture 01, Introduction

Is Data Mining a misnomer?

29

We don’t mine for data!

Page 30: Lecture 01, Introduction

Also known as

30

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Page 31: Lecture 01, Introduction

Is everything data mining?

31

Simple search and query processing

(Deductive) expert systems

Page 32: Lecture 01, Introduction

The Knowledge discovery process

32

databases task relevant data pattern evaluation

data mining

Knowledgedata warehouse

data cleaning

database, data warehousing community view

Page 33: Lecture 01, Introduction

EXAMPLE: WEB MINING

Data cleaning

Data integration from multiple sources

Warehousing the data

Data cube construction

Data selection for data mining

Data mining

Presentation of the mining results

Patterns and knowledge to be used or stored into knowledge-base

33

Page 34: Lecture 01, Introduction

Data mining in business intelligence

34

Increasing potentialto supportbusiness decisions End User

BusinessAnalyst

DataAnalyst

DBA

DecisionMaking

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data WarehousesData Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Page 35: Lecture 01, Introduction

The Machine Learning / Statistics View

35

input

Knowledgedata preprocessing

Data integration, Normalization, Feature selection, Dimension reduction

data mining

Pattern discovery, Association & correlation, Classification, Clustering, Outlier analysis

post-processing

Pattern evaluation, Pattern selection, Pattern interpretation, Pattern visualization

Page 36: Lecture 01, Introduction

WHICH VIEW DO YOU PREFER?

Which view do you prefer?

KDD vs. ML/Stat. vs. Business Intelligence

Depending on the data, applications, and your focus

Data Mining vs. Data Exploration

Business intelligence view

Warehouse, data cube, reporting but not much mining

Business objectives vs. data mining tools

Supply chain example: mining vs. OLAP vs. presentation tools

Data presentation vs. data exploration

36

Page 37: Lecture 01, Introduction

APPLICATIONS OF DATA MINING

37

Page 38: Lecture 01, Introduction

Web page analysis: from web page classification, clustering to PageRank & HITS algorithms

38

Page 39: Lecture 01, Introduction

Collaborative analysis & recommender systems

39

Page 40: Lecture 01, Introduction

Basket data analysis to targeted marketing

40

Page 41: Lecture 01, Introduction

Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis

41

microarray

biological network

Page 42: Lecture 01, Introduction

Software engineering and

data mining

42

fix bugs

code completioncode optimization

estimate costs

source code

binaries

execution traces

commits

Page 43: Lecture 01, Introduction

dedicated data mining tools

43

Orcale data mining tools

SAS

MS SQL Server Tools

Page 44: Lecture 01, Introduction

DATA MINING: A MULTI-

DIMENSIONAL VIEW

44

Page 45: Lecture 01, Introduction

The Data

45

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

Page 46: Lecture 01, Introduction

Data Mining Functions

46

Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

Descriptive vs. predictive data mining

Multiple/integrated functions and mining at multiple levels

Page 47: Lecture 01, Introduction

Techniques

47

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

Page 48: Lecture 01, Introduction

Applications

48

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Page 49: Lecture 01, Introduction

WHAT KINDS OF DATA?

49

Page 50: Lecture 01, Introduction

Database-oriented datasets and applications

50

Relational databases, data warehouse, transactional databases

Object-relational databases, Heterogeneous databases and legacy databases

Page 51: Lecture 01, Introduction

Advanced datasets and

advanced applications

51

Data streams and sensor data

Structure data, graphs, social networks and information networks

Time-series data, temporal data, sequence data (incl. bio-sequences)

Multimedia databases

Spatial data and spatiotemporal data

Text databases The World-Wide Web

Page 52: Lecture 01, Introduction

WHAT CAN WE DISCOVER

WITH THIS DATA?

52

Page 53: Lecture 01, Introduction

Generalization

53

Information integration and data warehouse construction

Data cleaning, transformation, integration, and multidimensional data model

Page 54: Lecture 01, Introduction

Generalization

54

Data cube

Scalable methods for computing (i.e., materializing) multidimensional aggregates

OLAP (online analytical processing)

Page 55: Lecture 01, Introduction

Generalization

55

Multidimensional concept description: characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region

Page 56: Lecture 01, Introduction

Association and Correlation Analysis

56

Frequent patterns

What items are frequently purchased together at the local grocery store?

Page 57: Lecture 01, Introduction

Association and Correlation Analysis

57

Association, Correlation, and Causality

there are subtle differences between them!

Page 58: Lecture 01, Introduction

An Association Rule

58

Diapers ⇒ Beer, [0.5%, 75%]support confidence

How to mine such patterns and rules efficiently in large datasets?

How to use such patterns for classification, clustering, and other applications?

Page 59: Lecture 01, Introduction

Association and Correlation Analysis

59

correlation measures the linear dependence between two numeric variables

Page 60: Lecture 01, Introduction

What is the difference between association

and correlation?

60

Association and Correlation Analysis

Diapers ⇒ Beer, [0.5%, 75%]

Page 61: Lecture 01, Introduction

correlation ≠

causation

61

Association and Correlation Analysis

Page 62: Lecture 01, Introduction

Classification

62

Classification and label prediction

Construct models (functions) based on some training examples

Describe and distinguish classes or concepts for future prediction

Predict some unknown class labels

climate, gas mileage

Page 63: Lecture 01, Introduction

Classification

63

Typical methods

Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …

Page 64: Lecture 01, Introduction

Classification

64

Typical applicationsCredit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

Page 65: Lecture 01, Introduction

CLUSTER ANALYSIS

Unsupervised learning (i.e., Class label is unknown)

Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns

Principle: Maximizing intra-class similarity & minimizing interclass similarity

Many methods and applications

65

Page 66: Lecture 01, Introduction

OUTLIER ANALYSIS

Noise or exception? ― the value of the outlier is application dependent

Methods: byproduct of clustering or regression analysis, …

Useful in fraud detection, rare events analysis

66

A data object that does not comply with the general behavior of the data

Page 67: Lecture 01, Introduction

SEQUENTIAL PATTERN, TREND AND EVOLUTION ANALYSIS

Trend, time-series, and deviation analysis: e.g., regression and value prediction

Sequential pattern mining

e.g., first buy digital camera, then buy large SD memory cards

Periodicity analysis

Motifs and biological sequence analysis

Approximate and consecutive motifs

Similarity-based analysis

Mining data streams

Ordered, time-varying, potentially infinite, data streams

67

Page 68: Lecture 01, Introduction

MINING GRAPHS

Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments)

68

Page 69: Lecture 01, Introduction

NETWORK ANALYSIS

Social networks: actors (objects, nodes) and relationships (edges)

e.g., author networks in CS, terrorist networks

Multiple heterogeneous networks

A person could be multiple information networks: friends, family, classmates, …

Links carry a lot of semantic information: Link mining

69

political blogs

terrorist networks

Page 70: Lecture 01, Introduction

MINING THE WEB

70

Web community discovery, opinion mining, usage mining, …

Page 71: Lecture 01, Introduction

EVALUATION

Is all discovered knowledge interesting?

71

May not be representative, be transient, …

One can discover a very large number of “patterns”

Some may fit only certain dimension space (time, location, …)

Page 72: Lecture 01, Introduction

EVALUATION

Why not discover only interesting knowledge?

72

Coverage

Descriptive vs. predictive

Typicality vs. novelty

Accuracy

Timeliness

Page 73: Lecture 01, Introduction

WHAT KINDS OF

TECHNOLOGIES ARE USED?

73

Page 74: Lecture 01, Introduction

A Confluence of Technologies

74

Data MiningMachine Learning

Pattern Recognition

Statistics

Applications

Algorithms

Database Technology

High Performance Computing

Visualization

Page 75: Lecture 01, Introduction

Why a confluence?

Page 76: Lecture 01, Introduction

Massive76

Algorithms must be scalable to handle big data

Page 77: Lecture 01, Introduction

High dimensional

77

Micro-array may have tens of thousands of

dimensions

Page 78: Lecture 01, Introduction

Complex, Diverse

78

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social and information networks

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

Page 79: Lecture 01, Introduction

MAJOR ISSUES IN

DATA MINING79

Page 80: Lecture 01, Introduction

Mining Methodology

80

Mining various and new kinds of knowledge Mining knowledge

in multi-dimensional space

An interdisciplinary effort Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining

Page 81: Lecture 01, Introduction

User Interaction

81

Interactive miningPresentation and visualization of data mining results

Incorporation of background knowledge

Page 82: Lecture 01, Introduction

Efficiency and Scalability

82

Parallel, distributed, stream, and incremental mining methods

Space and Time complexity of data mining algorithms

Page 83: Lecture 01, Introduction

Data Type diversity

83

Handling complex data types

Mining dynamic, networked, and global data repositories

Page 84: Lecture 01, Introduction

Data mining and society

84

what is the social impact of data mining?

Privacy-preserving

Invisible

Page 85: Lecture 01, Introduction

Privacy is an important issue

Page 86: Lecture 01, Introduction

Summary

86

Major issues

Data mining: Discovering interesting patterns and knowledge from massive amount of data

A natural evolution of science and information technology, in great demand, with wide applications

A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

Mining can be performed on a variety of data

Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.

Data mining technologies and applications

Page 87: Lecture 01, Introduction

A BRIEF HISTORY1989 IJCAI Workshop on Knowledge Discovery in Databases

Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

1991-1994 Workshops on Knowledge Discovery in Databases

Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

Journal of Data Mining and Knowledge Discovery (1997)

ACM SIGKDD conferences since 1998 and SIGKDD Explorations

More conferences on data mining

PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.

ACM Transactions on KDD (2007)

87

Page 88: Lecture 01, Introduction

KDD CONFERENCES

ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)

SIAM Data Mining Conf. (SDM)

(IEEE) Int. Conf. on Data Mining (ICDM)

European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)

Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)

Int. Conf. on Web Search and Data Mining (WSDM)

88

Page 89: Lecture 01, Introduction

RELATED JOURNALS AND CONFERENCES

Journals:

Data Mining and Knowledge Discovery (DAMI or DMKD)

IEEE Trans. On Knowledge and Data Eng. (TKDE)

KDD Explorations

ACM Trans. on KDD

Conferences:

DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, …

Web and IR conferences: WWW, SIGIR, WSDM

ML conferences: ICML, NIPS

PR conferences: CVPR, ICCV

89

Page 90: Lecture 01, Introduction

REFERENCE BOOKSE. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009

B. Liu, Web Data Mining, Springer 2006

Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012

P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001

S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

90