an introduction to data mining. definition data mining refers to the mining or discovery of new...
TRANSCRIPT
An Introduction to Data Mining
Definition
Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data.
It is the process used to find new, hidden or unexpected patterns in data to predict the future of the business.
It is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.
Data mining
Process of semi-automatically analyzing large databases to find patterns that are: valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to
interpret the patternAlso known as Knowledge Discovery in
Databases (KDD)
Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data
The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications.
The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to
generate predictions).
DATA MININING
Data Mining refers to extracting or ‘Mining ‘ Knowledge from large amounts of data.
Mining is is the characterization of process of extracting precious material from set of raw materials.
The KDD process
Problem formulation Data collection
subset data: sampling might hurt if highly skewed data feature selection: principal component analysis,
heuristic search
Pre-processing: cleaning name/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values
Transformation: map complex objects e.g. time series data to features
e.g. frequency Choosing mining task and mining method: Result evaluation and Visualization:
Knowledge discovery is an iterative process
Knowledge Discovery Process
Phases:1.Data Selection2.Data Integration3.Data Cleaning4.Enrichment5.Data Transformation or encoding6.Data Mining
Data selection, is about specific items or categories of items from stores in a specific region or area of the country may be selected.
Data integration is where multiple data sources are integrated.
The data cleaning process then may be correct invalid zip codes or eliminate records with incorrect phone prefixes.
Enrichment typically enhances the data with additional sources of information.
Data transformation and encoding may be done to reduce the amount of data.
Data mining techniques are used to mine different rules and patterns
Knowledge Discovery Process
______
______
______
Transformed Data
Patternsand
Rules
Target Data
RawData
KnowledgeData MiningTransformation
Interpretation& Evaluation
Selection& Cleaning
Integration
Understanding
Knowledge Discovery Process
DATAWarehouse
Knowledge
Why Use Data Mining Today?
Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate
Availability of: Data Storage Computational power Off-the-shelf software Expertise
Why Data Mining Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a particular customer?
Customer relationship management: Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor? :
Data Mining helps extract such information
Data Mining Step in Detail
2.1 Data preprocessing Data selection: Identify target datasets and
relevant fields Data cleaning
Remove noise and outliers Data transformation Create common units Generate new fields
2.2 Data mining model construction2.3 Model evaluation
Preprocessing and Mining
Original Data
TargetData
PreprocessedData
PatternsKnowledge
DataIntegration
and Selection
Preprocessing
ModelConstruction
Interpretation
Applications
Banking: loan/credit card approval predict good customers based on old customers
Customer relationship management: identify those who are likely to leave for a competitor.
Targeted marketing: identify likely responders to promotions
Fraud detection: telecommunications, financial transactions from an online stream of event identify fraudulent events
Manufacturing and production: automatically adjust knobs when process parameter
changes
Applications
Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship
between diseases
Molecular/Pharmaceutical: identify new drugsScientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion: find affinity of visitor to pages and modify layout
Application Areas
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis
Telecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis
Relationship of Data Mining with other fields
Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and architectures
whereas foundations of methods and formulations provided by statistics and machine learning.
automation for handling large, heterogeneous data
Data Mining in Use
The US Government uses Data Mining to track fraud
A Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers