an introduction to data mining. definition data mining refers to the mining or discovery of new...

18
An Introduction to Data Mining

Upload: alban-gaines

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

An Introduction to Data Mining

Page 2: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Definition

Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data.

It is the process used to find new, hidden or unexpected patterns in data to predict the future of the business.

It is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Page 3: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Data mining

Process of semi-automatically analyzing large databases to find patterns that are: valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to

interpret the patternAlso known as Knowledge Discovery in

Databases (KDD)

Page 4: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data

The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications.

The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to

generate predictions).

Page 5: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

DATA MININING

Data Mining refers to extracting or ‘Mining ‘ Knowledge from large amounts of data.

Mining is is the characterization of process of extracting precious material from set of raw materials.

Page 6: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

The KDD process

Problem formulation Data collection

subset data: sampling might hurt if highly skewed data feature selection: principal component analysis,

heuristic search

Pre-processing: cleaning name/address cleaning, different meanings (annual,

yearly), duplicate removal, supplying missing values

Transformation: map complex objects e.g. time series data to features

e.g. frequency Choosing mining task and mining method: Result evaluation and Visualization:

Knowledge discovery is an iterative process

Page 7: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Knowledge Discovery Process

Phases:1.Data Selection2.Data Integration3.Data Cleaning4.Enrichment5.Data Transformation or encoding6.Data Mining

Page 8: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Data selection, is about specific items or categories of items from stores in a specific region or area of the country may be selected.

Data integration is where multiple data sources are integrated.

The data cleaning process then may be correct invalid zip codes or eliminate records with incorrect phone prefixes.

Enrichment typically enhances the data with additional sources of information.

Data transformation and encoding may be done to reduce the amount of data.

Data mining techniques are used to mine different rules and patterns

Knowledge Discovery Process

Page 9: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

______

______

______

Transformed Data

Patternsand

Rules

Target Data

RawData

KnowledgeData MiningTransformation

Interpretation& Evaluation

Selection& Cleaning

Integration

Understanding

Knowledge Discovery Process

DATAWarehouse

Knowledge

Page 10: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Why Use Data Mining Today?

Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate

Availability of: Data Storage Computational power Off-the-shelf software Expertise

Page 11: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Why Data Mining Credit ratings/targeted marketing:

Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

Identify likely responders to sales promotions

Fraud detection Which types of transactions are likely to be fraudulent,

given the demographics and transactional history of a particular customer?

Customer relationship management: Which of my customers are likely to be the most loyal,

and which are most likely to leave for a competitor? :

Data Mining helps extract such information

Page 12: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Data Mining Step in Detail

2.1 Data preprocessing Data selection: Identify target datasets and

relevant fields Data cleaning

Remove noise and outliers Data transformation Create common units Generate new fields

2.2 Data mining model construction2.3 Model evaluation

Page 13: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Preprocessing and Mining

Original Data

TargetData

PreprocessedData

PatternsKnowledge

DataIntegration

and Selection

Preprocessing

ModelConstruction

Interpretation

Page 14: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Applications

Banking: loan/credit card approval predict good customers based on old customers

Customer relationship management: identify those who are likely to leave for a competitor.

Targeted marketing: identify likely responders to promotions

Fraud detection: telecommunications, financial transactions from an online stream of event identify fraudulent events

Manufacturing and production: automatically adjust knobs when process parameter

changes

Page 15: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Applications

Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship

between diseases

Molecular/Pharmaceutical: identify new drugsScientific data analysis:

identify new galaxies by searching for sub clusters

Web site/store design and promotion: find affinity of visitor to pages and modify layout

Page 16: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Application Areas

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis

Telecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis

Page 17: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Relationship of Data Mining with other fields

Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and architectures

whereas foundations of methods and formulations provided by statistics and machine learning.

automation for handling large, heterogeneous data

Page 18: An Introduction to Data Mining. Definition  Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast

Data Mining in Use

The US Government uses Data Mining to track fraud

A Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers