chapter-1 introduction to data miningshodhganga.inflibnet.ac.in/bitstream/10603/22134/6/06_chapter...
TRANSCRIPT
Chapter-1
Introduction to Data Mining
In recent years there has been an explosive growth in the generation and storage of electronic information as more and more operations of enterprises are computerized. The cost of processing power and storage has been declining for many years. Major enterprises have been using database management system for at least 30 years and have been accumulated a large amount of information. Some companies, for example major banks and telecommunication companies, now have mountains of data and have started to realize that the information accumulated over the year is an important strategic asset. There is a potential business intelligence hidden in the large volume of data. What these companies require is techniques that allow them to distil the most valuable information from the accumulated data. The field of data mining provides such techniques.
Data mining or knowledge discovery in Databases (KDD) deals with exploration techniques based on advanced analytical methods and tools for handling a large
. amount of information. It is a collections of techniques that finds novel patterns that can assist an enterprise in understanding its business better. Many data mmmg techniques are closely related to some of the machine learning techniques.
The data mining process involves much hard work, including perhaps building data warehouse if the enterprise does not have one. A typical data mining process is likely to include the following steps:
1. Requirements analysis: The enterprise decision makers need to formulate goals that the data mining process is expected to achieve. The business problem must be clearly defined. One cannot use data mining without a1 good idea of what kind of outcomes the enterprise is looking for, since the. technique to be used and the data that is required are likely to be different for different goals. Furthermore, if the objectives have been clearly define~, it is easier to evaluate the results of the project. Once the goals have been agreed upon, the following further steps are needed.
2. Data selection and collection: This step may include finding the best source databases for the data that is required. If the enterprise has implemented a data warehouse, then most of the data could be available there. If the data is not available in the warehouse or the enterprise does not have a warehouse, the source OL TP systems need to be identified and the required information extracted and stored in some temporary system. In some cases, only a sample of the data available may be required.
1
3. Cleaning and preparing data: This may not be an onerous task if a data warehouse containing the required . data exists, since most of this must have already been done when data was loaded in the warehouse. Otherwise this task can be very resource intensive and sometimes more than 50% of effort in a data mining project is spent on this step. Essentially, a data store that integrates data from a number of databases may need to be created. When integrating data, one often encounters problems like identifying data, dealing with missing data, data conflicts and ambiguity. An ETL (extraction, transformation and loading) tool may be used to overcome these problems.
4. Data mining exploration and validation: Once appropriate data has been collected and cleaned, it is possible to start data mining exploration. Assuming that the user has access to one or more data mining tools, a data mining model may be constructed based on the enterprise's needs. It may be possible to take a sample of data and apply a number of relevant techniques. For each technique the results should be evaluated and their significance interpreted. This is likely to be an iterative process which should lead to selection of one or more techniques that are suitable for further exploration, testing, and validation.
5. Implementing, evaluating, and monitoring: Once a model has been selected and validated, the model can be implemented for use by the decision makers. This may involve software development for generating reports, or for results visualization and explanation,. for managers. It may be that more than one technique is available for the given data mining task. It is then important to evaluate the results and .choose the best technique. Evaluation may involve checking the accuracy and effectiveness of the technique. Furthermore, there is a need for regular monitoring of the performance of the techniques that have been implemented. It is essential that use of the tools by the managers be monitored and results evaluated regularly. Every enterprise evolves with time and so must the data mining system. Therefore, monitoring is likely to lead from time to time to refinement of tools and techniques that have been implemented.
6. Results visualization:, Explaining the results of data mining to the decision makers is an important step of the data mining process. Most commercial data mining tools include data visualization modules. These toqls are often vital in communicating the data mining results to the managers, although a problem dealing with a number of ·dimensions· must be visualized using a twodimensional computer screen or printout. Clever data visualization tools are being developed to display results that deal with more than two dimensions. The visualization tools available should be tried and used if found effective for the given problem.
2
1.1 Data Mining
DBMS gave access to the data stored but no analysis of data. Analysis was
required to unearth the hidden relationships within the data i.e. for decision support.
Increase in size of databases e.g. VLDBs, needed automated techniques for analysis as
they have grown beyond manual extraction. Typical scientific user knew nothing of
commercial business applications. The business database programmers knew nothing
of massively parallel principles. Solution was for database software producers to
create easy-to-use tools ·and form strategic relationships with hardware manufacturers.
This resulted in development of separate specialized stream of Computer Science
called Data Mining.
Data Mining can be defined as, "Non trivial extraction ofimplicit, previously
unknown, and potentially useful information from data".
Data Mining covers variety · of techniques to identify nuggets' of information or
decision-making knowledge in bodies of data, and extracting these in such a way that
they can be .Put to use in the areas such as decision support, prediction, forecasting
and estimation. The data is often voluminous, but as it stands of low value as no direct
use can be made of it; it is the hidden information in the data that is really useful. Data
mining encompasses a number of different technical approaches, such as clustering,
data summarization, learning classification rules, finding dependency net works,
analyzing changes, and detecting anomalies. Data mining is the analysis of data and
the use of software techniques for finding patterns and regularities in sets of data. The
computer is responsible for finding the. patte.rns by id~ntifying the underlying rules . . . . I .
. . and features in the data. It is possible to 'strike gold' in unexpected places as the data
mining software extracts patterns not previously discernable or so obvious that no-one
has noticed them before. In Data Mining, large volumes of data are sifted in an
attempt to find something worthwhile ... , .. , .......
3.
Data mining plays a leading role in the every facet of Business. It is one of the
ways by which a company can gain competitive advantage. Through application of
Data mining, one can tum large volumes of data collected from various front-end
systems like Transaction Processing Systems, ERP, and operational CRM into
meaningful knowledge.
At this stage it ·is pertinent to understa11d the difference between (i) Data (ii)
Information and (iii) knowledge.
For example a number like 3000 is data. Adding a context to data, converts data
into information. Let us add context i.e. Rs to number 3000 making it Rs. 3000. Rs.
3000 conveys much more meaning than number 3000.Rule contains (more contexts in
a relative Sense) compared to Information. If the salary of a person is Rs 3000 p.m. he
is poorly paid. A generalized rule for example, if salary of a person is between Rs
2000- Rs 5000 p.m., he is poorly paid: REPRESENTS KNOWLEDGE.
So in essence data mining technology converts data contained in database to a
form of rule, and generalized RULE which is knowledge. Put in another way, it
summarizes the information contained in thousands of records to few rules say 20 or
30, which is more actionable and meaningful.
Issues in Data Mining are Noisy data, Missing values, Static: data, Sparse data,
Dynamic data, Relevance,· Interestingness, Heterogeneity, Algorithm efficiency & Size
and complexity of data.
Businesses are looking for new ways to let end users find the data they need to
make decisions, serve customers and gain the competitive edge.
Some of the applications where Data Mining is used:
• Medicine - drug side effects, hospital cost analysis, genetic sequence analysis,
prediction etc.
• Finance- stock market prediction, credit assessment, fraud detection etc.
• Marketing/sales - product analysis, buying patterns, sales prediction, target
mailing, identifying 'unusual behavior' etc.
4
• Knowledge Acquisition
• Scientific discovery - superconductivity research, etc.
• Engineering - automotive diagnostic expert systems, fault detection etc.
1.2 Data Mining Techniques
Major techniques for converting Data into information/knowledge are:
( 1) Classification/ Prediction
(2) Affinity Grouping or Association Rules
(3) Clustering
( 4) Visualization techniques
Classification I Prediction:
Classification I Prediction involve use training examples where the value of
variable to be classified is already known. The model is presented with data
containing the class (which is known) as well as other attributes describing the class.
The underlying algorithm learns the mapping between the attribute values and class.
Sometimes these are also known as dependent and independent variables.
When once the model has learnt the mapping any new data containing the
attributes is presented to the model and the model classifies the sample into one of the
learnt classes.
Affinity Grouping/ Association Rules:
The task is to determine which items go together. Retail store can use the affinity
grouping for store layout; identify items that are purchased together, profitability
analysis, promotion/planning etc. It can also be used to identify cross-selling
opportunities.
5
Clustering:
Clustering is the task of segmenting a diverse group into number of smaller
groups. What distinguisher clustering from classification is that clustering does not
rely on predefined class.
Cluster Analysis
o Clustering and segmentation is basically partitioning the database so
that each partition or group is similar according to some criteria or
metric
o Clustering according to similarity is a concept which appears in many
disciplines e.g. in chemistry the clustering of molecules
o Data mining applications make use of clustering according to similarity
e.g. to segment a client/customer base
o It provides sub-groups of a population for further analysis or action -
very important when dealing with very large databases
o Can be used for profile generation for target marketing 1.e. where
previous response to mailing campaigns can be used to generate a
profile of people who responded and this can be used to predict
response and filter mailing lists to achieve the best response
Visualization:
We have age old saying which says one picture is worth hundred pages of written
information. Visualization of data in multi dimension space will bring lot of insight
in the data. It will also help to identify outlines, unusual values before one starts the
analysis.
6
1.3 Basic Steps In Data Mining
The basic steps in Data mining are:
1. Identify and obtain data
2. Validate, Explore, Clean data
3. Transform data to right level of Granularity
4. Add derived variables
5. Choose Modeling Techniques
6. Train the Model
7. Check the Model performance
8. Choose the best model
Broadly types of data used in Data mining applications are given below:
- Demographic data such as ago, income, profession etc.
- Transaction data: Specific to application
- Data shared within an Industry: Credit reports, catalogues etc.
- Data shared from Business partners
Data collected for data mining can be grouped under four major categories:
(1) Categorical: Set of values, which a field can take. There is no ordering or
hierarchy in data. Examples: Profession, Qualification, City, Pin code etc.
(2) Rank: Similar to categorical data but have a natural ordering. Examples
are income range like low, middle, high, Age: Young, middle age, old, very
old etc.
(3) Numeric: Amount purchased, Quality purchased, volume of Transaction,
No. of items in Inventory etc.
( 4) Date: Information about the date. This contains a wealth of Information
like
(i) Day of week
(ii) Day of month
(iii) Month in year
(iv) Quarter of the year
7
(v) Day of the year
When analyzing the data, date fields can be used in many different ways.
Derived Variables:
They are calculated columns not in original data. One set of derived variables contain
values from data aggregation, like weekly sales from daily sales. Depending upon the
application, the data has to be aggregated.
Another important class of derived variables are those that perform calculations from
original data values
Examples are: Debt/ earning
Income/Total Sales
Credit limit- balance etc .......... .
The purpose of derived variables is to find useful information that is not apparent in
original data.
Role of Fields in Data Mining:
Input Field(s)/Columns: Used as inputs to the Model. Sometimes these are referred
to as independent variables
Target Field/Columns: This is the field about which we are trying to understand and
normally linked to some form of behavior. For Example:
buy a product
Customer buys/does not
Customer is profitable/not profitable etc.
Ignored Fields/Columns : Columns that are not used.
Normal Data Format for Data Mining:
- All data should be in a single table or database view ........ ,,.,.. I
- Each row should correspond to an instance that is relevant to business
Columns with single/unique values for every new ignored.
Data Comes From:
Operational systems
8
DBMS (Database Management Systems)
ERP (Enterprise Resource Planning System)
Web servers and e-commerce data bases
Billing system(s)
Telecom switches
Point of sale/ A TM' s
Data Warehouses
1.4 Data Mining Process
Stages
• Data pre-processing
Scarch.!or patterns-Qu.cr ics, rulcs, neural nets, M.L,. statistics etc.,
t
o heterogeneity resolution
o data cleansing
o data warehousing ·
• Data Mining Tools applied
Analyst rev icws
OUtput
.Revise and rc£"ane queries
I
o extraction of patterns from the pre-processed data ,,, :······
• Interpretation and evaluation
o user bias i.e. can direct DM tools to areas of interest
• attributes of interest in databases
• goal of discovery
Jn.terprct Rcsul ts and take action
9
• domain knowledge
• prior knowledge or belief about the domain
1.5 Data Mining and Machine Learning
Differences
• Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about
finding understandable knowledge
• Machine Learning (ML) is concerned with improving performance of an agent
o training a neural network to balance a pole is part of ML, but not of
KDD
• Efficiency of the algorithm and scalability is more important in DM or KDD
o DM is concerned with very large, real-world databases
o ML typically looks at smaller data sets
• ML has laboratory type examples for the training set
• DM deals with 'real world' data
o Real world data tend to have problems such as:
• missing values
• dynamic data
• pre-existing data
• nmse
1.6 Issues in Data Mining
• Noisy data
• Missing values
• Static data
• Sparse data
• Dynamic data
10
1.7
I N F u T
F A T T E R ;o.,T
•
•
•
•
•
..
Relevance
Interestingness
Heterogeneity
Algorithm efficiency
Size and complexity of data
Knowledge Representation Methods
Neural Networks
o a trained neural network can be thought of as an "expert" in the category
of information it has been given to analyze
o provides projections given new situations of interest and answers "what
if' questions
o problems include:
• the resulting network is viewed as a black box
• no explanation of the results is given i.e. difficult for the user to
interpret the results
• difficult to incorporate user intervention
• slow to train due to their iterative nature
0 u
... Ca. cer- ,.;.,k T F u T
11
A neural net can be trained to identify the risk of cancer from a number of factors
• Decision trees
o used to represent knowledge
o built using a training set of data and can then be used to classify new
objects
o problems are:
• Rules
• opaque structure - difficult to understand
• missing data can cause performance problems
• they become cumbersome for large data sets
o Example ouUock
o probably the most common form of representation
o tend to be simple and intuitive
o unstructured and less rigid
o problems are:
• difficult to maintain
• inadequate to represent many types of knowledge
o Example format
• if X then Y
• Frames
o templates for holding clusters of related knowledge about a very
particular subject
12
o a natural way to represent knowledge
o has a taxonomy approach
o problem is
• more complex than rule representation
Related Technologies
• Data Warehousing
• On-line Analytical Processing, OLAP
1.8 Data Warehousing
Definition
• A data warehouse can be defined as any centralized data repository which can
be queried for business benefit
• warehousing makes it possible to
o extract archived operational data
o overcome inconsistencies between different legacy data formats
o integrate data throughout an enterprise, regardless of location, format, or
communication requirements
o incorporate additional or expert information
Characteristics of a Data warehouse:
• Subject-oriented - data organized by subject instead of application e.g.
o an insurance company would organize their data by customer, premium,
and claim, instead of by different products (auto, life, etc.)
· o contains only the information necessary for decision support processing
• Integrated - encoding of data is often inconsistent e.g.
13
o gender might be coded as "m" and "f" or 0 and 1 but when data are.
moved from the operational environment into the data warehouse they
assume a consistent coding convention
• Time-variant - the data warehouse contains a place for storing data that are five
to 10 years old or older e.g.
o this data is used for comparisons, trends, and forecasting
o these data are not updated
• non-volatile - data are not updated or changed in any way once they enter the
data warehouse
o data are only loaded and accessed
Data warehousing Processes:
• insulate data - i.e. the current operational information (
o preserves the security and integrity of mission-critiCal OL TP
applications
o gives access to the broadest possible base of data
• retrieve data - from a variety of heterogeneous operational databases
o data is transformed and delivered to the data warehouse/store based on a
selected model (or mapping definition)
o metadata - information describing the model and definition of the source
data elements
• data cleansing - removal of certain aspects of operational data, such as low
level transaction information, which slow down the query times.
• transfer - processed data transferred to the data warehouse, a large database on
a high performance box
Uses of a Data warehouse
• A central store against which the queries are run
14
o uses very simple data structures with very little assumptions about the
relationships between data
• A data mart is a sinall warehouse which provides subsets of the main store,
summarized information
o depending on the requirements of a specific group/department
o marts often use multidimensional databases which can speed up query
processing as they can have data structures which are reflect the most likely
questions
Data Warehouse Model
Optirniud OLTF dt:tti:Jasf!S _., W!7mause wtzler DB2, Ora:le etc.
...
Structure of Data inside the Data Warehouse
i
15
. ~ ·~ .
M E T A D A T A
An example of levels of summarization of data
M E T A D A T A
n~ticm!l !Illes by mcmth 1985-1993
reg ian~! s!les Wwetk 1'!183-1993 .
Criteria for a Data Warehouse
• Load Performance
man thl'!l s!les b'!l pradr.lct line 1981-1993
weeki~ s!les by sub product 1985-1993
s!les detQI 1992-1993
!Illes detQ I 1982-1991
Lis;htl~ sul'?1!rnorizf!d
Cu I'T'~nt det~ ~ ~~v~~
o require incremental loading of new data on a periodic basis
o must not artificially constrain the volume of data
• Load Processing
16
o data conversions, filtering, reformatting, integrity checks, physical
storage, indexing, and metadata update
• Data Quality Management
o ensure local consistency, global consistency, and referential integrity
despite "dirty" sources and massive database size
• Query Performance
o must not be slowed or inhibited by the performance of the data
warehouse RDBMS
• Terabyte Scalability
o Data warehouse sizes are growing at astonishing rates so RDBMS must
not have any architectural limitations. It must support modular and
parallel management.
• Mass User Scalability
o Access to warehouse data must not be limited to the elite few has to
support hundreds, even thousands, of concurrent users while
maintaining acceptable query performance.
• Networked Data Warehouse
•
o Data warehouses rarely exist in isolation, users must be able to look at
and work with multiple warehouses from a single client workstation
Warehouse' Administration
o large scale and time-cyclic nature of the data warehouse demands
administrative ease and flexibility
• The RDBMS must Integrate Dimensional Analysis
o dimensional support must be inherent in the warehouse RDBMS to
provide the highest performance for relational OLAP tools
• Advanced Query Functionality
o End users require advanced analytic calculations, sequential and
comparative analysis, and consistent access to detailed and summarized
data
17
Problems with Data warehousing
• The rush of companies to jump on the band wagon as
These companies have slapped 'data warehouse' labels on traditional transaction
processing products and co- opted the lexicon of the industry in order to be
considered players in this fast-growing category.
1.9 Data warehousing & OL TP
Similarities and Differences between OLTP and Data Warehouse
OLTJ"' l.)ata VVarch.ous c
Purpose Run day-to-day operations Information retrieval and analysis
Structure RDBMS RDBMS
Data Model Normalised Multi- dim.ensional
Access SQL S Q L plus data analysis extensions
Type of Data Data that runs the business Data that analyses the business
Condition of Data Changing, incomplete Historical, descriptive
• OL TP systems designed to maximize transaction capacity but they:
o · cannot be repositories of facts and historical data for business analysis
o cannot quickly answer ad hoc queries
o rapid retrieval is almost impossible
o data is inconsistent and changing, duplicate entries exist, entries can be
mtssmg
o OL TP offers large amounts of raw data which is not easily understood
• Typical OL TP query is a simple aggregation e.g.
o What is the current account balance for this customer?
18
Data Warehouse systems
• Data warehouses are interested in query processing as opposed to transaction
processmg
• Typical business analysis query e.g.
o Which product line sells best in middle-America and how does this
correlate to demographic data?
OLAP (On-line Analytical processing
• Problem is how to process larger and larger databases
• OLAP involves many data items (many thousands or even millions) which are
involved in complex relationships
• Fast response is crucial in OLAP
• Difference between OLAP and OL TP
OLAP
o OL TP servers handle mission-critical production data accessed through
simple queries
o OLAP servers handle management-critical data accessed through an
iterative analytical investigation·
Common Analytical Operations
• Consolidation - involves the aggregation of data i.e. simple roll-ups or complex
expressions involvipg inter-related data
o e.g. sales offices can be rolled-up to districts and districts rolled-up to
regwns
• Drill-Down - can go in the reverse direction i.e. automatically display detail
data which comprises consolidated data
19
• "Slicing and Dicing" -ability to look at the data base from different viewpoints
e.g.
o one slice of the sales database might show all sales of product type
within regions;
o another slice might show all sales by sales channel within each product
type
o often performed along a time axis in order to analyze trends and find
patterns
Knowledge acquisition using Data Mining
• Expert systems are models of real world processes
• Much of the information is available straight from the process e.g.
o in production systems, data is collected for monitoring the system
o ·knowledge can be extracted using data mining tools
o experts can verify the knowiedge
• Example
o TIGON project - detection and diagnosis of an industrial gas turbine
eng me
1.10 Data Mining Projects
UU - Jordan town
• Data mining in the N Ireland Housing Executive
• Knee disorders classification
• Fault diagnosis in a telecommunication network ·
• A self-learning urology patient audit system
• Policy lapse/renewal prediction
• House price prediction
20
UUJExample
Policy lapse/renewal prediction
• Problem - predicting whether a motor insurance policy will be lapsed or
renewed
• 34 attributes stored for each policy
o 14 attributes were deemed relevant
o 2 attributes were derived from underlying attributes
• Predictive accuracy
0 71%
• In a period of 3 weeks achieved the same accuracy as statistical models
developed by the insurance company which had taken much longer to develop
MKS
The Mining Kernel System
• Based on the interdisciplinary approach of data mining
NetMap
TH-169_77
Cluster Analysis
21
MKS
• Data pre-processing functionality i.e.
o statistical operations for removing outliers
o sampling
o data dimensionality reduction
o dealing with missing data
• Algorithms provided for
o classification
o association
• Facility to provide what 1s interesting to the user and. presenting only
interesting rules
• Facility to incorporate domain knowledge into the knowledge discovery
process
Classification
Classification is a well-known task in data mining that aims to predict the class of
an unseen instance as accurately as possible. While single label classification, which
assigns each rule in the classifier the most obvious label, has been widely studied [9,
11, 13, 18], little work has been done on multi-label classification. Most of the work
to date on multi-label classification is related to text categorisation [10, 15]. There are
many approaches for building single class classifiers from data, such as divide-and-!
conquer [14] and separate-and-conquer [8]. Most traditional learning techniques
derived from these approaches, such as decision trees [7, 13], and statistical and
covering algorithms [11], are unable to treat problems with multiple labels.
The most common multi-label classification approach is one-versus-the rest (OvR)
[ 17], which constructs a set of binary classifiers obtained by training on each possible
22
class versus all the rest. OvR approach performs the winner-take-all strategy that
assigns a real value for each class to indicate the class membership.
Another known approach in multi-label classification is one-versus-one (OvO)
[15], which constructs a classifier that has been trained on each possible pair of
classes. For K classes, this results in (K-1) K/2 binary classifiers, which may be
problematic if K is large. On the other hand, the OvR approach has been criticised for
training on several separate classification problems, since each class can easily be
separated from the rest, and therefore problems a rise, like contradictory decisions, i.e.
whenever two or more rules predict the test instance, and no decision, i.e. whenever
none of the resulting rules can predict the test instance [ 6].
Another important task in data mining is the discovery of all association rules in
data. Classification and association rule discovery are similar, except that there is only
one target to predict in classification, i.e., the class, while association rule can predict
any attribute in the data. In recent years, a new approach that integrates association
rule with classification, named associative classification, has been proposed [9, 12]. A
· few accurate classifiers that use associative· classification have been presented in the
past few years, such as CBA [12], CMAR [9], and CPAR [18]. In existing associative
classification techniques, only one class label is associated with each rule derived, and
thus rules are not suitable for the prediction of multiple labels. However, multi-label
classification may often be useful in practise. Consider for example, a document
which has two class labels "Health" and "Government", and assume that the
document is associated 50 times with the "Health" label and 48 times with the
"Government" label, and the number of times the document appears in the training
data is 98. A traditional associative technique like CBA generates the rule associated
with the "Health" label simply because it has a larger representation, and discards the
other rule. However, it is very useful to generate the other rule; since it brings up
useful knowledge having a large representation in the training data, and thus could
take a role in classification. In this paper, a novel approach for multi-class and multi
label classification, named multi class, multi-label associative classification (MMAC).
23