introduction to datamining

40
1 INTRODUCTION

Upload: sas-sndp-yogam-college-konni

Post on 26-Jan-2015

575 views

Category:

Education


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Introduction to DataMining

1

INTRODUCTION

Page 2: Introduction to DataMining

A young, fast growing and promising field

2

Page 3: Introduction to DataMining

INTRODUCTION

Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or KDD)

Extracting hidden information An interdisciplinary subfield of computer

science The computational process of discovering

patterns in large data sets Involving methods at the intersection of

Artificial intelligence, Machine learning, Statistics, and Database systems.

3

Page 4: Introduction to DataMining

INTORODUCTION(CONTD..)

• The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

Aside from the raw analysis step, it involves• database and data management aspects

data pre-processing model inference considerations

• complexity considerations, post-processing of discovered structures, visualization, and online updating.

4

Page 5: Introduction to DataMining

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Eg: Global backbone telecommunication network carry tens of

petabytes everyday

(1024 Gigabytes = 1 Terabyte)( 1024 Terabytes = 1 Petabyte)

Data collection and data availability

Automated data collection tools, database systems, Web,

computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras,…

5

Page 6: Introduction to DataMining

Why Data Mining?

“Necessity is the mother of invention” - Data mining—Automated analysis of massive data sets

6

Page 7: Introduction to DataMining

What Motivated Data Mining?

We are drowning in data, but starving for knowledge!

7

Page 8: Introduction to DataMining

Evolution of Database Technology

Data mining can be viewed as a result of natural evolution of IT

1960s: Data collection, database creation and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering,

etc.)

8

Page 9: Introduction to DataMining

Evolution of Database Technology 1990s:

Data mining, data warehousing, multimedia databases, and Web databases

2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and

global information systems

9

Page 10: Introduction to DataMining

10

Page 11: Introduction to DataMining

What Is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

11

Page 12: Introduction to DataMining

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

12

Page 13: Introduction to DataMining

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

13

Page 14: Introduction to DataMining

Knowledge Process

1. Data cleaning – to remove noise and inconsistent data

2. Data integration – to combine multiple source 3. Data selection – to retrieve relevant data for analysis4. Data transformation – to transform data into

appropriate form for data mining5. Data mining- An essential process where intelligent

methods are applied to extract data patterns6. Pattern Evaluation-Identify truly interesting patterns

representing knowledge based on interestingness measure

7. Knowledge presentation-visualization and representation techniques

14

Page 15: Introduction to DataMining

Example: A Web Mining Framework

Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored

into knowledge-base

15

Page 16: Introduction to DataMining

16

Data Mining in Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Page 17: Introduction to DataMining

17

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

This is a view from typical machine learning and statistics communities

Data integrationNormalizationFeature selectionDimension reduction

Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis… … … …

Pattern evaluationPattern selectionPattern interpretationPattern visualization

Page 18: Introduction to DataMining

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications Relational database, data warehouse, transactional database

Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data Multimedia database Text databases The World-Wide Web

18

Page 19: Introduction to DataMining

RDBMS

A database that has a collection of tables of data items, all of which is formally described and organized according to the relational model.

Data in a single table represents a relation.

Each table schema must identify a column or group of columns, called the primary key, to uniquely identify each row.

A relationship can then be established between each row in the table and a row in another table by creating a foreign key, a column or group of columns in one table that points to the primary key of another table.

19

Page 20: Introduction to DataMining

RDBMS

• Database normalization: The relational model offers various levels of refinement of table organization and reorganization .

• DBMS of a relational database is called an RDBMS, and is the software of a relational database.

• The relational database was first defined in June 1970 by Edgar Codd, of IBM's San Jose Research Laboratory.

• Codd's view of what qualifies as an RDBMS is summarized in Codd's 12 rules.

• A relational database has become the predominant choice in storing data.

20

Page 21: Introduction to DataMining

Relational database terminology.

A relation is defined as a set of tuples that have the same attributes

21

Page 22: Introduction to DataMining

RDMS(contd..)

Example :Allelectronics(Company described by relation tables:Customer,item,employee and branch)

Relation : customer is a group of entities describing the customer information(Cust_id,cust_name, Age,Occupation,annual income, credit information and category)Tables: used to represent the relationship between or among multiple entities Database queries(SQL): For data accessing using relational operations such as join, selection and projection

22

Page 23: Introduction to DataMining

Mining Relational databases

Can go further by searching for trends or data patterns

Examples Analyze customer data to predict the risk of

customers based on their income ,age Detect deviations: sales comparison with

previous year RDBMS are one of the most commonly available

and richest information repositories for data mining

23

Page 24: Introduction to DataMining

24

What is a Data Warehouse?

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately

from the organization’s operational database

Support information processing by providing a solid

platform of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

management’s decision-making process.”—W. H. Inmon

Data warehousing:

The process of constructing and using data warehouses

Page 25: Introduction to DataMining

DATA WAREHOUSES

Is a repository of information collected from multiple sources, stored under a unified schema.

Constructed via Data cleaning Data integration Data transformation Data Loading and periodic data

refreshing

25

Page 26: Introduction to DataMining

26

Page 27: Introduction to DataMining

Data warehouse is modeled by a multidimensional data structure

Data cube: precomputation &fast access of summarized data Each dimension corresponds to an attribute or a set of

attributes in a schema Each cell stores the value of some aggregate measure

(count, sum etc) Example: In Allelectronics the cube has three dimension :• Address(with city values, U S A, Canada, Mexico)• Time (with quarter values Q1,Q2,Q3,Q4)• Item(with type values )

DATA WAREHOUSES(contd…)

27

Page 28: Introduction to DataMining

28

Multidimensional Data

Sales volume as a function of product, month, and region

Product

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 29: Introduction to DataMining

29

A Sample Data Cube

Total annual salesof TVs in U.S.A.

Date

Produ

ct

Cou

ntrysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 30: Introduction to DataMining

Data mining functionalities

Tasks can be classified : Predictive(makes prediction about values of data using

known results found from different data) Descriptive( characterize properties of a target data set)

Explore the properties of the data examined

Data mining functionalities are used to specify the kinds of patterns

Characterization and Discrimination The mining of frequent patterns, associations and

correlations Classification and regression Cluster analysis Outlier analysis

30

Page 31: Introduction to DataMining

Characterization and Discrimination

Data characterization is a summarization of the general characteristics or features of a target class of data

Output of characterization can be presented in various forms Pie charts Bar charts Curves multidimensional data cube Multidimensional tablesDescriptions presented in generalized relations- Characteristic

rulesExample: In Allelectronics : Summarize the characteristic of

customers who spend more than $5000 a year at Allelectronics

this can be view in any dimension, such as on occupation to view these customers according to their type of employment.

31

Page 32: Introduction to DataMining

Data Discrimination

Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or more multiple contrasting class

Output representation similar to characterization description

Discrimination description expressed in the form of rules –Discrimination rules

Target and contrasting class specified by the userExample: User want to compare the general features of software products with

sales that increased by 10% and decreased by 30% during the same period

32

Page 33: Introduction to DataMining

Mining Frequent Patterns, Associations, Correlations

Frequent pattern Frequent item sets(Milk, bread) Frequent subsequences(Latop ,digital

camera ,memory card)

Frequent sub structures (graphs ,trees)Mining frequent patterns leads to the

discovery of interesting associations and correlation within data.

33

Page 34: Introduction to DataMining

Association analysis(example)Item frequently purchased together

buys(X, ”computer”) =>buys(X, ”software”) [support=1%, confidence=50%]

X - a variable representing a customer A confidence or certainty – 50%(chance)

1%(under analysis)Association rule- with single-dimension association rules

“computer => software[1%,50%]”.Age(X,”20..29”) ^ income(X,”40K..49K”)=>buys(X ,”laptop”) [support=2%, confidence=60%] (Multidimensional association rule)

34

Page 35: Introduction to DataMining

Classification and Regression for Predictive Analysis

Classification: the process of finding a model(function)that describes and distinguishes data classes or concepts

Model derived from analysis of a set of training data

Models are represented as Classification rules(IF-THEN rules) Decision trees Mathematical formulae or Neural networks

Regression: Statistical methodology for numeric prediction

35

Page 36: Introduction to DataMining

Cluster Analysis and Outlier Analysis Cluster Analysis:

Determining similarity among data on predefined attributes

The most similar data are grouped into clusters

Outlier Analysis Outliers: The dataset contain objects that do

not required for the model of the data Analysis of outlier data is referred to as

Outlier Analysis or Anomaly mining Detected using statstical tests

36

Page 37: Introduction to DataMining

37

Which Technologies Are Used?

Data Mining

MachineLearning

Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Page 38: Introduction to DataMining

Potential Applications of Data Mining Where there are data there

are data mining applications Data analysis and decision support ( Business Intelligence)

Market analysis and management Target marketing, customer relationship management

(CRM), market basket analysis, cross selling, market segmentation

Risk analysis and management Forecasting, customer retention, improved underwriting,

quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers)

Other Applications Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis

38

Page 39: Introduction to DataMining

39

Major Issues in Data Mining (1)

Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining

User Interaction

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results

Page 40: Introduction to DataMining

40

Major Issues in Data Mining (2)

Efficiency and Scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed, stream, and incremental mining methods

Diversity of data types

Handling complex types of data

Mining dynamic, networked, and global data repositories

Data mining and society

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining