introduction to knowledge discovery in databases and data mining

13
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING

Upload: oneida

Post on 10-Jan-2016

57 views

Category:

Documents


2 download

DESCRIPTION

Introduction to knowledge Discovery in Databases and Data Mining. Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute. What is Data Mining? or more generally, Knowledge Discovery in Databases (KDD). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to knowledge  Discovery  in Databases and  Data Mining

Prof. Carolina Ruiz

Department of Computer Science

Worcester Polytechnic Institute

INTRODUCTION TO

KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING

Page 2: Introduction to knowledge  Discovery  in Databases and  Data Mining

“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996]

• Raw Data Data Mining

• Patterns

• Analytical Patterns (rules, decision trees)

• Statistical Patterns (data distribution)

• Visual Patterns

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.

WHAT IS DATA MINING?OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)

Page 3: Introduction to knowledge  Discovery  in Databases and  Data Mining

NEED FOR DATA MINING

• Data are being gathered and stored extremely fast

• Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data

Page 4: Introduction to knowledge  Discovery  in Databases and  Data Mining

0102030405060708090

1stQtr

2ndQtr

3rdQtr

4thQtr

East

West

North

DATA ANALYSIS (KDD)PROCESS

data sources

data analysisdata mining• analytical

statistical• visual

models

model/patterns deployment• prediction

• decision supportnew data

data management

• databases• data warehouses

“good” model

model/patternevaluation• quantitative• qualitative

data “pre”-processing

• noisy/missing data • dim. reduction

cleandata

data

Page 5: Introduction to knowledge  Discovery  in Databases and  Data Mining

• Machine Learning (AI)• Contributes (semi-)automatic

induction of empirical laws from observations & experimentation

• Statistics• Contributes language, framework,

and techniques

• Pattern Recognition• Contributes pattern extraction and

pattern matching techniques

• Databases• Contributes efficient data storage,

data cleansing, and data access techniques

• Data Visualization• Contributes visual data displays and

data exploration

• High Performance Comp.• Contributes techniques to efficiently

handling complexity

• Application Domain• Contributes domain knowledge

KDD IS INTERDISCIPLINARYTECHNIQUES COME FROM MULTIPLE FIELDS

Page 6: Introduction to knowledge  Discovery  in Databases and  Data Mining

• Confirmatory (verification)• Given a hypothesis, verify its validity

against the data

• Exploratory (discovery)• Predictive patterns

• Patterns for predicting behavior of newly encountered entities

• Descriptive patterns

• Patterns for presenting the behavior of observed entities in a human-understandable format

DATA MINING MODES

Page 7: Introduction to knowledge  Discovery  in Databases and  Data Mining

WHAT DO YOU WANT TO LEARN FROM YOUR DATA?KDD APPROACHES

Data

classification

regression

clustering

summarization

dependency/assoc. analysis

change/deviation detection

0102030405060708090

1stQtr

2ndQtr

3rdQtr

4thQtr

East

West

North

IF a & b & c THEN d & kIF k & a THEN e

b lue

B

b lue

C

o ra nge

D

A

IF A & B THEN IF A & D THEN

A B

C D

0.5

0.750.3

A, B -> C 80%C, D -> A 22%

Page 9: Introduction to knowledge  Discovery  in Databases and  Data Mining

WEKAFrank et al., University of Waikato, New Zealand

ACADEMIC/OPEN SOURCE DATA MINING SYSTEMS RapidMiner

Klinkenberg et al., Univ. of Dortmund, Germany

R Programming Language Ross Ihaka and Robert Gentleman

Univ. of Auckland, New Zealand

and many more ….

Python Data Mining Libraries

Page 10: Introduction to knowledge  Discovery  in Databases and  Data Mining

DATA MINING RESOURCES – JOURNALS

• Data Mining and Knowledge Discovery JournalNewsletters:

• ACM SIGKDD Explorations Newsletter Related Journals:

• TKDE: IEEE Transactions in Knowledge and Data Engineering• TODS: ACM Transaction on Database Systems• JACM: Journal of ACM• Data and Knowledge Engineering• JIIS: Intl. Journal of Intelligent Information Systems

Page 11: Introduction to knowledge  Discovery  in Databases and  Data Mining

DATA MINING RESOURCES – CONFERENCES• KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining

• ICDM: IEEE International Conference on Data Mining,

• SIAM International Conference on Data Mining

• PKDD: European Conference on Principles and Practice of Knowledge Discovery in Databases

• PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining

• DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery

Related Conferences:

• ICML: Intl. Conf. On Machine Learning

• IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning

• IJCAI: International Joint Conference on Artificial Intelligence

• AAAI: American Association for Artificial Intelligence Conference

• SIGMOD/PODS: ACM Intl. Conference on Data Management

• ICDE: International Conference on Data Engineering

• VLDB: International Conference on Very Large Data Bases

Page 12: Introduction to knowledge  Discovery  in Databases and  Data Mining

DATA MINING RESOURCES – BOOKS, DATASETS, …

See resources webpage at:

• http://web.cs.wpi.edu/~ruiz/KDDRG/resources.html

Page 13: Introduction to knowledge  Discovery  in Databases and  Data Mining

SUMMARY

• KDD is the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”

• The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns

• Data mining is the discovery and extraction of patterns from data, not the extraction of data

• Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data