data mining: introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · why data mining? the...

18
Data Mining: Introduction

Upload: others

Post on 17-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Data Mining: Introduction

Page 2: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

– Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

– Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated

analysis of massive data sets

Page 3: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Evolution of Sciences

Before 1600, empirical science

1600-1950s, theoretical science

– Each discipline has grown a theoretical component. Theoretical models often

motivate experiments and generalize our understanding.

1950s-1990s, computational science

– Computational Science traditionally meant simulation. It grew out of our inability to

find closed-form solutions for complex mathematical models.

1990-now, data science

– The flood of data from new scientific instruments and simulations

– The ability to economically store and manage petabytes of data online

– The Internet and computing Grid that makes all these archives universally

accessible

– Scientific info. management, acquisition, organization, query, and visualization

tasks scale almost linearly with data volumes.

Page 4: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

4

Evolution of Database Technology 1960s:

– Data collection, database creation, IMS and network DBMS

1970s:

– Relational data model, relational DBMS implementation

1980s:

– RDBMS, advanced data models (extended-relational, OO, deductive,

etc.)

– Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

– Data mining, data warehousing, multimedia databases, and Web

databases

2000s

– Stream data management and mining

– Data mining and its applications

– Web technology and global information systems

Page 5: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Lots of data is being collected

and warehoused

– Web data, e-commerce

– purchases at department/

grocery stores

– Bank/Credit Card

transactions

Computers have become cheaper and more powerful

Why Mine Data? Commercial Viewpoint

Page 6: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Why Mine Data? Scientific Viewpoint

Data collected and stored at

enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene

expression data

– scientific simulations

generating terabytes of data

Traditional techniques infeasible for raw data

Data mining may help scientists

– in classifying and segmenting data

Page 7: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Mining Large Data Sets - Motivation

There is often information “hidden” in the data that is not readily evident

Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of

analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Page 8: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

What is Data Mining?

Many Definitions

– Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Page 9: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

KDD Process

Input Data Data Mining

Data Pre-Processing

Post-Processing

Page 10: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Data Mining in Business Intelligence

Increasing potential

to support

business decisions End User

Business

Analyst

Data

Analyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data Mining Information Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Page 11: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

What is not Data Mining?

– Look up phone number in phone directory

Page 12: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Draws ideas from machine learning/AI, pattern

recognition, statistics, and database systems

Origins of Data Mining

Machine Learning/

Pattern

Recognition

Statistics/

AI

Data Mining

Database

systems

Page 13: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Data Mining Tasks

Prediction Methods

– Use some variables to predict unknown or

future values of other variables.

Description Methods

– Find human-interpretable patterns that

describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Page 14: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Why Data Mining?—Potential Applications

Data analysis and decision support

– Market analysis and management

Target marketing, customer relationship management (CRM),

market basket analysis, cross selling, market segmentation

– Risk analysis and management

Forecasting, customer retention, quality control

– Fraud detection and detection of unusual patterns (outliers)

Other Applications

– Text mining and Web mining

– Bioinformatics and bio-data analysis

Page 15: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Ex. 1: Market Analysis and Management Where does the data come from?

– Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studies

Target marketing

– Find clusters of “model” customers who share the same characteristics:

interest, income level, spending habits, etc.

– Determine customer purchasing patterns over time

Customer profiling

– What types of customers buy what products (clustering or

classification)

Customer requirement analysis

– Predict what factors will attract new customers

Page 16: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Ex. 2: Corporate Analysis & Risk Management

Finance planning and asset evaluation

– cash flow analysis and prediction

– cross-sectional and time series analysis

(financial-ratio, trend analysis, etc.)

Resource planning

– summarize and compare the resources and

spending

Page 17: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Ex. 3: Fraud Detection & Mining Unusual Patterns

Applications: Health care, retail, credit card service, telecomm.

– Auto insurance: fraud detection

– Money laundering: suspicious monetary transactions

– Medical insurance

Professional patients, ring of doctors.

Unnecessary or correlated screening tests

– Anti-terrorism

Page 18: Data Mining: Introductionturgaybilgin/2015-2016-bahar/... · 2016-02-19 · Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes – Data collection and data

Data Mining Tasks...

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation/Anomaly/Outlier Detection [Predictive]